OOC-FtLE  COPY  m A 0724 4 


LEVI 


ARI  TECHNICAL  REPORT 


O TR-79-A9 


Principles  of  Work  Sample  Testing: 
II.  Evaluation  of  Personnel  Testing  Programs 


Robert  M.  Guion 

BOWLING  GREEN  STATE  UNIVERSITY 
Bowling  Green,  Ohio  43403 


April  1979 


Contract  DAHC  19-77 - C-0007 


D C 


! jn:;:- 

U Ji ■ 


Prepared  for 


U.S.  ARMY  RESEARCH  INSTITUTE 

for  the  BEHAVIORAL  and  SOCIAL  SCIENCES 

5001  Eiioahow.o.r  Avmm, 

Alexandria,  Virgiaia  22333  A 

7 Q H 

Appioved  for  public  release;  distribution  unlimited. 


ft  6 088 


U.  S.  ARMY  RESEARCH  INSTITUTE 

FOR  THE  BEHAVIORAL  AND  SOCIAL  SCIENCES 

A Field  Operating  Agency  under  the  Jurisdiction  of  the 
Deputy  Chief  of  Staff  for  Personnel 

WILLIAM  L.  HAUSER 

JOSEPH  ZEIDNER  Colonel.  US  Army 

Technical  Director  Commander 


NOTICES 


DISTRIBUTION  Primary  distribution  of  this  report  has  been  made  by  ARI  Please  address  correspondence 
concerning  distribution  of  reports  to  U S Army  Reteerch  Institute  for  the  Behavioral  and  Social  Sotncet. 
ATTN  PERI  P.  500t  E»senho*«f  Avenue.  Alexandria.  Virginia  22333 


FINAL  DISPOSITION  This  report  may  be  destroyed  *rhen  it  is  no  longer  needed  Pieese  do  not  return  it  to 
me  U S Army  Research  Institute  for  the  Betwiorai  and  Social  Sciences 


NOTE  The  findings  »«  this  report  ere  not  to  be  construed  as  an  official  Department  of  the  Army  position, 
untess  so  designated  by  other  euth0nfed  documents 


✓ 


UNCLASSIFIED 


SECURITY  CLASSIFICATION  OF  THIS  PACE  (TFti*n  Dmlm  BmmiQ 


REPQjUL  DOCUMENT ATION  PAGE 


(A 


Q 


P6PORT  NUMBER 


Technical  Kef 


G*f /i*  £ i 

~ l—  I — — . BgfcgINS 


INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


TR-79-A9 


r- 


2.  OOVT  ACCESSIO 


^PRINCIPLES  OF  WORK  SAMPLE  USSTING. 

II.  EVALUATION  OF  PERSONNEL  TESTING  PROGRAMS 


3.  RECIPIENT’S  CATALOG  NUMBER 


PERIOD  COVERED 

Final  *"  ‘’-pb  I — ■ 

15  Nqv  »7b  - 15  JuiiiMj-] 

6 PERFORMING  ORG.  REPORT 


7.  AuTmORC*} 


Robert  M. /Guior 


,/gl 


DAHC19-7 


<•) 


PERFORMING  ORGANIZATION  NAME  AND  AOORESS 

Bowl  in q Green  State  University 
Bowling  Green,  Ohio  91403 


PROGRAM  ELEMEN  T.  PROJECT.  TASK 
AREA  • WORK  UNIT  NUMBERS 


•I.  CONTROLLING  OFFICE  NAME  AND  AOORESS 

US  Army  Research  Institute  for  the  Behavior 
and  Social  Sciences 

5001  Eisenhower  Avenue.  Alexandria,  Virqina  22333 


14  MONITORING  AGENCY  NAME  * AOORESV" 


AprMHM79  I 

* ar1#  AGES 

62 


I is.  SECURITY  CLASS,  (of  thf  rmpon) 
1 ! 

Vnclassif  led 


QLtt 


_e  \ 


1S«  OECL  ASSIFIC  ATION  DOWNGRADING 
SCHEOULE 


t.  DISTRIBUTION  STATEMENT  {of  tbit  Report) 

tovtvi  tor  ; iblic  release;  distribution  unlimited. 


»T  DISTRIBUTION  statement  (ot  the  ebetrmet  entered  in  Blech  70.  it  different  from  Report) 


t«  supplementary  notes 

Monitored  by  G.  Gary  hoye.rn,  Engagement  Simulation  Technical  Area,  Army 
Research  Institute. 


• KEY  WOROS  ( Continue  on  rereree  tide  it  neceeeery  < 


I Idor.ttfy  by  blech  number) 


Measurement  t he;. I y,  i syehone: 1 ics,  work  sample  testing,  validity,  content- 
referenced  testing,  criterion-referenced  testing,  latent  trait  theory, 

• ii'ner  » l i rahi  1 i tv  theory 


ik  ‘ ABSTRACT  (TwiRm  m WNW  jpjwM  H ww— vf  M<  fMwMUy  Wy  block  wm>wj 

■«.  ; i.  . .ire  offered  for  increasing  the  objectivity  of  measurement  11 

. i ogi ants  . t personnel  testing.  Classical  concepts  of  reliability  and  validity 
are  reviewed.  innstruct  validity  is  seen  as  the  basic  evaluation  of  a measuring 
instrument  m psychology:  criterion-related  validity  actually  refers  to  hypoth- 
eses rather  than  to  measurements,  and  content  validity  refers  to  test  develop- 
ment. The  ma jor  evaluation  for  personnel  tests  is  less  a matter  of  validity 
than  of  job  relevance  and  of  general izabil ity . Implications  of  latent  trait 
theory  and  gener a 1 izab 1 1 i ty  theory  are/t}  iscussed  in  terms  of  content-referenced 


00  dST*  U73  EDfTlOM  or  • MOV  St  IS 

C (,  1 t a a 


c 


4 


UNCLASSIFIED 


CURITY  CLASSIFICATION  OF  THIS  P AGE  n.f.  Enl»r»tf> 


I 


UNCLASSIFIED 


SECURITY  CLASSIFICATION  of  this  PAGE(TFh*n  Data  Bnfararf) 


PRINCIPLES  OF  WORK  SAMPLE  TESTING:  II.  EVALUATION  OF  PERSONNEL  TESTING 
PROGRAMS 

BRIEF 

Personnel  testing  should  be  as  objective  as  possible.  Objectivity 
in  measurement  occurs  under  two  conditions:  if  the  scale  does  not 
depend  on  who  has  been  measured  with  it,  and  if  the  measures  do  not 
i depend  on  the  specific  scale  used.  If  the  stimulus-response  content 

of  the  test  permits  verifiable  responses,  if  the  format  inposes  no  con- 
straints on  the  responses,  and  if  the  responses  are  free  from  distortion, 
the  principle  of  objectivity  is  approached. 

There  are  aspects  of  the  test,  however,  other  than  its  stimulus- 
response  content.  Scoring  procedures  which  sire  defined  without  respect 
to  the  content  nay  be  attached  to  it;  inferences  are  often  drawn  going 
far  beyond  the  content.  The  more  objectively  attributes  can  be  meas- 
ured, the  less  reaching  is  needed  to  make  appropriate  inferences  and, 
therefore,  the  less  elaborate  the  research  needed  to  evaluate  the  measure- 
ment. 

Classical  concepts  of  reliability  and  validity  are  reviewed.  Cri- 
terion-related validity  is  noted  as  concerned  with  inferences  about 
other  variables  rather  than  inferences  about  the  measure  used  as  a 
predictor;  criterion-related  validity  therefore  does  not  evaluate  the 
measurement  per  se,  although  it  evaluates  hypotheses  about  predictor- 
criterion  relationships.  Construct  validity  is  seen  as  the  essence  of 
validity,  and  it  is  defined  in  terms  of  the  preportion  of  total  variance 
explainable  by  the  construct  being  measured,  me  essence  of  construct 
validity  research  is  disoonfirmatory;  that  i's,  it  ds  intended  to  con- 
sider alternative  interpretations  of  the  meaning  of  scores  which,  if 
supported,  would  disoonf irm  the  originally  preposed  inferences. 

Content  validity  is  not  really  validity  at  all;  it  is  an  evalua- 
tion of  the  procedures  of  test  construction,  • not  of  inferences  drawn 
from  scores.  In  personnel  testing,  the  test  development  procedure  can 
lead  logically  from  definitions  of  a job  content  universe  and  domain 
to  the  definition  of  a relevant  test  content  domain  and  establishment 
of  test  specifications;  if  the  test  is  constructed  according  to  those 
specifications,  its  job  relevance  is  virtually  assured.  Uhder  certain 
circumstances,  the  assurance  of  job  relevance  is  all  that  is  needed  in 
evaluating  a personnel  test. 

Alternatives  to  classical  psychometric  theory  are  examined  for 
potential  value  in  personnel  testing,  especially  in  work  sarrple  testing. 
Work  sarrple  tests  are  seen  as  being,  by  definition,  content-referenced 


tests.  Latent  trait  theory  is  examined  for  its  implications  for 
scaling  personnel  tests,  and  the  implications  of  generalizability 
research  are  also  considered. 

It  is  concluded  that  too  much  attention  is  given  to  classical 
concepts  of  validity  and  not  enough  to  the  more  immediately  important 
evaluations  of  job  relatedness  and  of  generalizability  beyond  the  test- 
ing situation. 


i 


TABLE  OF  COfTTENIS 


j INTRODUCTION 1 

1 GENERAL  CONSIDERATIONS  IN  THE  EVALUATION  OF  TESTING  PROGRAMS  1 

TOE  TOST  AS  STIMULUS 1 

THE  TOST  AS  RES  POSE 4 

i TOE  TOST  AS  INFERENCE 7 

THE  TOST  AS  A TOOL  FOR  DECISION 8 

CLASSICAL  PSYCHOMETRIC  THEORY:  RELIABILITY 11 

CLASSICAL  PSYCHOMETRIC  THEORY:  VALIDITY 15 

CRITERION- RELATED  VALIDATION 17 

CONSTRUCT  VALIDITY 21 

CONTENT  VALIDITY 24 

Jcfc  Content  Universe 26 

Job  Content  Domain 27 

Test  content  Universe 28 

Test  Content  Domain . .29 

The  Limits  of  Content  Sampling  as  Validity 29 

ACCEPTANCE  OF  OPERATIONAL  DEFINITIONS  31 

Intrinsic  Validity  35 

Operationalism  Based  on  Formal  Structure  36 

CHALLENGES  TO  CLASSICAL  THEORY 36 

CONTOirr-REFERENCED  MEASUREMENT 38 

Work  Sanples  as  Content-Referenced  Tests 40 

Job  Analysis 43 

Assembling  Test  Cot  tent 44 

Scaling  Test  Content 44 

Evaluating  Content- Referenced  Tests 45 

LATENT  TRAIT  THEORY 47 

The  Theoretical  Foundations 48 

Uses  of  Latent  Trait  Analysis 51 

Evaluation 52 

GENERALI  ZAB I LI  TY  THEORY 53 


/ 


LIST  OF  FIGURES 


Schematic  Diagram  Showing  Contribution  to 
Objectivity  of  Three  Response  Dimensions 
(Adapted  from  Guion,  1965) .... 

Ven..  Diagrams  Relating  Universes  and 
Domains  of  Job  and  Test  Content 


Page 

5 

30 


3 Sanples  and  Inferences  in  Work  Sanple  Testing  . . 42 

4 Item  Characteristic  Curves  of  Three 

Hypothetical  Items 49 


\//l 


INTRODUCTION 


t 


i 


f 


1 


The  preceding  paper  in  this  series  surveyed  psychological  measure- 
ment in  general.  It  concluded  with  the  idea  that  different  kinds  of 
measurement  of  different  kinds  of  variables,  and  perhaps  for  different 
purposes,  demand  a different  emphasis  in  evaluation.  This  paper  will 
also  be  quite  general,  although  with  explicit  references  to  the  central 
problem  of  work  sample  testing,  as  it  describes  important  considerations 
in  the  evaluation  of  personnel  testing  programs.  This  discussion  assumes 
that  personnel  testing  is  best  understood  as  taking  place  in  settings  of 
institutional  control,  even  if  actually  done  as  field  research,  and  that 
it  covers  the  gamut  of  variables  and  methods  of  measurement. 

GENERAL  CONSIDERATIONS  IN  HIE  EVALUATION  OF  TESTING  PROGRAMS 

The  emphasis  is  on  the  total  evaluation  of  a total  program;  the  eval- 
uation of  a testing  program  includes  but  should  not  be  limited  to  valida- 
tion. In  some  circumstances,  conventional  questions  of  validity  may  not 
arise  at  all;  where  valid  inferences  from  scores  must  be  ascertained, 
positive  research  results  may  be  sufficient,  but  negative  results  leave 
many  unanswered  questions.  A total  testing  program  consists  of  offering 
a test  under  a standard  circumstance  as  a stimulus,  obtaining  and  scoring 
responses,  and  drawing  inferences  from  the  scores  for  the  sake  of  making 
personnel  decisions.  Only  the  latter  directly  uses  classical  validation 
procedures . 

TOE  TEST  AS  STIMULUS 

The  test  content,  instructions,  administrative  procedures,  format, 
and  the  situation  in  which  the  test  is  administered  all  contribute  to  a 
stimulus  complex  which  should  be  standardized;  evaluation  of  a testing 
program  should  inquire  first  into  the  details  of  its  standardization. 

Are  instructions  given  according  to  clearly  standard  procedures?  If  so, 


are  they  uniformly  understood  by  all  examinees  before  the  test  begins? 

If  there  are  time  limits  or  other  constraints  on  performance,  are  they 
rigidly  standard.’  ad  and  enforced,  as  they  should  be? 

The  principle  basic  to  these  and  virtually  all  other  questions  in 
evaluation  is  straightforward:  does  a person’s  score  on  the  test  repre- 
sent clearly  the  attribute  being  measured,  or  do  other  attributes  of  the 
person,  the  test,  the  procedure,  or  the  setting  in  which  the  testing  is 
done  have  seme  influence  on  the  score?  TO  the  extent  that  irrelevant 
attributes  influence  obtained  scores,  a testing  program  is  in  some  sense 
deficient. 

The  second  line  of  inquiry  concerns  the  degree  to  which  the  content, 
format,  structure,  and  technique  contribute  to  the  objectivity  of  measure- 
ment. Hie  term  objectivity  has  been  indiscriminantly  applied  in  psycholog- 
ical measurement  without  much  precision  of  meaning;  when  people  rerer  to 
objective  tests,  they  frequently  mean  multiple-choice  tests.  It  is  true 
that  such  tests  may  be  objectively  scored,  but  the  measure  becomes  less 
than  objective  to  the  extent  that  the  available  options  either  constrain 
or  suggest  the  responses  of  the  examinee. 

The  most  objective  measurement  is  mathematically  formal  and  is  best 
illustrated  by  physical  measurements.  TWo  characteristics  of  such  measure- 
ment make  it  genuinely  objective:  (a)  the  scale  exists  independently  of 
the  objects  used  in  developing  it,  and  (b)  the  measurement  is  independent 
of  the  particular  instrument  used  for  measuring.  In  presenting  these 
requirements  for  objectivity,  Wright  said  by  way  of  illustration,  "But 
when  a man  says  he  is  five  feet  eleven  inches  tall,  do  we  ask  to  see  his 
yardstick?"  (Wright,  1968,  p.  87) . 

By  these  standards,  traditional  psychological  testing  is  never  objec- 
tive. The  measures  (scores)  depend  on  the  particular  sets  of  questions 
asked  and  on  the  sample  of  people  (objects)  used  in  item  analysis,  and 

- 2 - 


jj  the  meaning  depends  on  the  sanple  used  in  establishing  norms.  By  analogy, 

• however,  sane  characteristics  of  objective  measurement  can  be  approximated 

in  even  traditional  psychological  testing.  In  objective  measurement  of 
the  length  of  objects,  for  example,  the  measure  can  be  verified,  the  instru- 
ment imposes  no  constraints  on  the  results  of  measurement  (save  in  the 
t , fineness  of  calibration) , and  the  object  itself  cannot  distort  the  measure- 

i ment.  By  analogy,  since  psychological  measurement  is  based  on  responses, 

■ a test  is  objective  to  the  extent  that  its  content  permits  responses  that 

can  be  verified,  and  it  places  no  constraints  on  the  nature  of  the  responses, 
and  that  the  responses  are  undistorted.  The  reference  to  the  multiple- 
choice  format  as  "objective"  suggests  a further-  analogy  in  that  the  observer 
who  reads  the  yardstick  or  who  scores  tire  test  should  not  be  able  to 
distort  the  results. 

The  first  contribution  of  the  test  as  stimulus  to  objectivity  is  its 
i content.  A test  of  arithmetic  skill  problems  can  be  far  more  objective 

than  a measure  calling  for  endorsements  of  statements  of  belief,  partly 
because  it  is  less  ambiguous.  In  any  content  area,  the  objectivity  of 
measurement  can  be  enhanced  if  the  content  domain  to  be  sanpled  is  clearly 
defined  and  the  procedures  fot  samp lino  clearly  specified.  Clarity  in 
! defining  the  domain  and  the  procedures  for  sampling  is  insurance  against 

ambiguity.  Ambigui ty  of  content  pit  distort  responses,  and  it  will  surely 
\ lead  to  unreliable  inferences  from  them. 

i 

In  evaluating  the  test  as  stimuius,  no  sjxxnal  constraints  need  to 
be  placed  on  the  nature  of  the  donum  to  be  defined.  It  may  be  a perform- 
ance domain,  a domain  of  factual  information , or  a domain  of  approaches 
[ to  measurement.  For  example,  one  nay  set  out  to  construct  a test  of 

s problem-solving  ability.  Literally  dozens  of  problem-solving  tasks  may 

! be  used.  One  might  use  block  designs,  small  assembly  tasks,  manipulative 

I tasks,  exercises  in  logical  reasoning,  and  countless  others.  The  domain 

r 

| of  problem-solving  is  not  a very  unified  domain.  If  one  wanted  to  sample 

j for  the  problem-solving  test  all  possible  kinds  of  problem-solving  tasks, 

r 

\ ~ * " 

i 

i 


the  result  would  be  not  only  an  incredibly  long  test  but  one  that  would 
lack  internal  consistency  — an  inportant  evaluative  consideration.  The 
domain,  therefore,  might  be  defined  specifically  in  terms  of  the  use  of 
anagrams.  Within  this  domain,  boundaries  can  be  established  and  the 
characteristic  tasks  available  for  sanpling  can  be  identified.  One  might 
specify  that  the  darain  will  consist  of  seven-letter  anagrams  of  from  two 
to  five  vowels.  Other  specifications  can  be  added.  With  the  domain 
clearly  defined,  a test  constructor  can  establish  rules  for  sanpling  the 
domain.  The  rules  may  specify  only  procedures  for  sanpling  the  content 
domain,  or  they  nay  specify  statistical  rules  for  accepting  items  sarrpled. 
For  example,  if  the  test  is  to  be  used  for  conventional  norm- referenced 
interpretations,  it  may  be  specified  that  item  difficulties  are  to  be 
■within  a given  range  and  that  all  item-total  correlations  must  be  above 
some  minimum  value.  The  result  of  clear  specifications  of  the  test  domain 
should  be  clearer  meaning  of  scores. 

These  points  should  not  be  over- emphasized  in  an  evaluation.  An 
excellent  testing  program  may  be  based  on  serendipitous  findings  using 
haphazardly  constructed  tests.  Nevertheless,  the  final  overall  evaluation 
of  the  testing  program  is  more  likely  to  be  favorable  if  the  stimulus 
properties  of  the  program  have  been  carefully  constructed. 

THE  TEST  AS  RESPONSE 

The  content  of  a test,  and  therefore  the  content  of  the  domain  sampled, 
is  a stimulus-response  content.  Hie  test  is  not  the  printed  instructions 
or  questions  or  assigned  tasks;  it  is  the  combination  of  instructions  and 
success  in  following  them,  questions  and  answers,  or  tasks  and  performance. 
Objectivity  of  traditional  measurement  from  resfonses  to  stimuli,  already 
discussed,  is  illustrated  in  Figure  1. 

If  the  response  options  for  the  exami nee  ire  wholly  open-ended,  the 
testing  content  is  u-'ined  in  [xu  t by  the  consent  analysis  of  the  responses 


4 


and  the  resulting  scoring  categories.  Most  performance  tests  involve  open- 
ended  responses.  A work  sample  test,  for  exanple,  consists  of  telling  the 
examinee  to  do  something.  The  actual  responses  {that  is,  the  performance) 
may  be  observed  and  classified,  or  the  consequences  of  the  response 
(that  is,  the  product  of  the  behavior)  may  be  evaluated  along  selected 
dimensions.  The  measurement  is,  however,  objective  in  forrmt,  if  not  in 
scoring,  because  of  the  unrestricted  opportunities  for  response.  Objec- 
tivity may  suffer  (because  of  the  necessity  to  classify  responses),  however, 
if  the  procedure  permits  observer  or  scorer  characteristics  to  influence 
can  obtained  score  intended  to  be  a measure  of  an  attribute  of  the  examinee. 
Maximum  objectivity  in  measurement  may  require  an  optimal  tradeoff 
between  the  distortion  created  by  artificially  restricting  responses  and 
the  distortion  created  by  the  unreliability  or  bias  of  observers'  classi- 
fication and  scoring  procedures. 

If  the  form  of  the  test  is  in  a restricted  response  mode,  such  as 
multiple-choice,  the  definition  of  the  test  detrain  should  include  possible 
or  plausible  responses.  It  is  a useful  practice,  and  an  indication  of 
great  care,  to  begin  the  construction  of  a multiple-choice  test  by  admin- 
istering the  items  in  open-ended  form  to  a substantial  sanple  of  people. 

The  responses  used  to  complete  an  item  stem  can  be  tallied,  and  a domain 
of  potential  responses  can  be  identified  with  rules  for  selecting  the 
correct  and  distracting  options. 

Responses  to  test  items  or  tasks  must  yield  scores  if  there  is  to  be 
any  measurement.  This  is  one  of  those  underwhelming,  obvious  kinds  of 
statements  that  of  ton  seems  to  be  overlooked.  The  point  is  that  the 
stimulus- response  content  does  not  often  include  the  score.  In  a multiple- 
choice  test,  the  traditional  scoring  is  sinply  a count  of  the  number  of 
items  answered  correctly,  but  this  is  a traditional  convenience,  not 
dictated  by  ‘he  content  dcxm in.  Tn  other  forms  of  elimination,  such  as 
work  sairplos,  the  soornu,  procedure  may  have  to  lx:  invented.  In  either 
case,  the  scoring  procedure  rxxJds  hi  be  eve  In  * loth  for  classical 

_ 6 _ 


reliability  and  for  the  possibility  of  contamination  in  the  scores. 


THE  TEST  AS  INFERENCE 


This  heading  includes  the  traditional  concern  for  validity  in  the 
evaluation  of  testing.  The  topic  will  be  examined  in  more  detail  in  a 
subsequent  section;  it  is  sufficient  here  to  indicate  the  possible  variety 
of  inferences  and  the  evaluative  questions  of  validity  they  pose. 

Che  form  of  inference,  according  to  the  APA  Standards  (APA,  AERA,  and 
NCME,  1974) , is  the  inference  of  performance  in  a domain  based  on  perform- 
ance on  the  sanple.  Evaluation  of  this  sort  of  inference  has  been  based 
on  the  ambiguous  notion  of  content  validity. 

A different  type  of  inference  involves  inferring  performance  on  one 
measure  from  performance  on  a different  one.  Evaluation  of  such  infer- 
ences is  based  on  criterion-related  validity. 

In  the  third  class  of  inferences,  an  individual's  standing  on  seme 
underlying  characteristic  presumably  measured  by  the  test  is  inferred  from 
the  score.  Evaluation  of  such  inferences  is  based  on  construct  validity. 

Different  inferences  are  sought  for  different  purposes,  and  therefore 
the  enphasis  on  evaluating  the  validity  of  inferences  is  likely  to  differ 
in  different  testing  situations.  Nevertheless,  it  should  be  understood 
that  virtually  all  mental  measurement  involves  to  some  degree  all  three 
kinds  of  inference.  On  the  anagrams  best  of  problem-solving  discussed 
earlier,  to  evaluate  the  inferences  as  valid,  the  evaluator  must  be  willing 
to  infer  that  performance  on  that  set  of  anagrams  is  a good  indicator  of 
performance  on  any  other  set  of  anagrams  from  the  same  specified  domain, 
he  must  be  willing  to  infer  that  performance  on  the  anagrams  task  is 
related  to  performance  on  something  else  of  particular  interest  to  the 
evaluator,  and  he  must  be  able  to  infer  that  the  performance  on  the 

- 7 - 


anagrams  fits  a network  of  relationships  in  which  the  scores  can  be  inter- 
preted in  terms  of  problem- solving  ability  rather  than  m terms  of  some  other 
characteristic  such  as  verbal  comprehension. 

THi:  TEST  AS  A TOOL  FOR  DECISION 

Mast  personnel  testing  is  done  to  provide  a basis  for  decisions,  not 
primarily  to  measure  an  attribute.  Evaluation  of  the  test  as  a measuring 
instrument  is  important  to  its  evaluation  as  a decision  tool,  but  the  two 
evaluations  should  not  be  confused.  An  excellent  measure  nay  be  a poor 
basis  for  decision;  a poor  measure  nay  nevertheless  be  the  best  decision 
Lool  available. 

Decisions  are  based  on  predictions,  either  literal  or  implied.  In 
txirsonnel  testinq,  therefore,  the  usual  and  primary  evaluation  of  a test- 
ing program  lies  in  the  nagnitude  of  the  correlation  between  scores  on 
the  test  and  subsequent  measures  of  the  variable  to  be  predicted.  Even 
in  situations  whore  it  is  either  infeasible  or  unnecessary  to  oonpute 
such  a correlation  coefficient,  the  hxnc  of  trying  to  maximize  an 
implied  predictive  relationship  terrains  the  {laramount  basis  for  evaluating 
decision  tools. 

A principal  implication  of  that  logic  ;s  the  general  rule  that  com- 
plex ter  forma  nee  can  be  predicted  better  with  a set  of  predictors  than 
with  any  one  test.  In  most  practical  personnel  prediction  problems,  a 
test  battery  will  be  devised,  and  some  form  >f  composite  score  will  be 
corputed  for  each  person.  In  all  discussions  of  test  scores  that  follow, 
this  composite  score  is  as  relevant  as  a scorn  on  a single  test. 

Multivariate  prediction  does  not  necessarily  o>-  uniformly  imply  a 
aonposite.  The  different  variables  might  b irrungod  in  some  sort  of 
sequence  of  decisions.  Whore  this  procedure  s followed,  one  evaluates 
the  testing  program,  and  any  pvirt  Leul  ir  test,  o * ;oi-f  oonposite  within  it. 


in  the  light  of  its  position  within  the  sequence.  Its  position  in  the 
sequence  becomes  another  aspect  of  the  setting  m which  the  best  is  given. 


\ 


* 


k 


The  decision  to  be  rade  is  not  an  automatic  consequence  of  the  pre- 
diction. A cutting  score  ray  be  set,  above  which  individuals  are  selected 
or  certified  (or  whatever),  and  it  ray  fluctuate  from  time  to  time  accord- 
ing to  changing  standards  or  to  supply  and  demand.  Subjective  considera- 
tions may  influence  decisions  independently  of  test  scores.  In  some 
settings,  variables  that  ray  have  influenced  obtained  test  scores  ray  be 
considered  by  applying  a mathematic  correction  or  some  sort  of  subjective 
fudge  factor.  Whether  the  decision  is  based  solely  on  test  scores,  or 
whether  other  considerations  influence  the  decision,  the  decision  itself 
is  the  final  step  in  the  testing  process  to  be  evaluated.  According  to 
modern  decision  theory',  tire  evaluation  should  be  based  on  concepts  of 
utility  and  cost  ef foctiveness.  A comparison  of  the  costs  of  Type  I unci 
Type  II  errors  should  be  made  in  evaluating  the  utility  of  the  decisions. 

Although  the  logic  of  prediction  is  almost  always  implied  in  any 
personnel  decision,  the  arithmetic  of  prediction  may  be  superfluous.  The 
logical  prediction  frequently  rade,  particularly  when  testing  for  a 
particular  skill,  is  that  a hiqh-soorinn  person  will  perform  better  by 
using  that  skill  if  placed  in  a job  or  a training  program  that  derands 
it.  For  example,  it  is  almost  an  unarguable  proposition  that  a person 
who  scores  high  on  a test  of  typing  skill  will  be  able  to  handle  the 
typing  assignments  of  the  ordinary  office.  There  is  neither  any  need  to 
compute  a criterion- re la ted  correlation  coefficient,  nor  is  there  much 
desire  for  it,  since  the  extraneous  factors  that  might  inhibit  performance 
(such  as  immediate  conflict  with  the  supervisor)  are  of  little  interest 
in  the  evaluation  of  the  testinq  program.  In  tliese  kinds  of  situations, 
the  test  score  is  interpreted  on  its  own  terms.  One  who  types  more  words 
per  minute  is  assured  to  be  able  to  type  more  words  per  minute  than 
someone  else.  The  score  is  its  own  operational  definition  of  a skill 
that  is  prerequisite  to  successful  performance  on  a job. 

- 9 - 


If  the  conditions  of  performance  on  the  30b  are  substantially  differ- 
ent from  the  conditions  of  performance  in  the  testing  situation,  a question 
of  generalizability  arises,  lb  use  an  absurd  but  descriptive  example, 
individual  differences  in  a standardized  typing  test  might  have  very  little 
relationship  to  individual  differences  in  performing  the  same  typing  task 
in  a pitching  rowboat. 

The  example  is  an  extreme  example  of  the  problem  of  the  generalizabil- 
ity of  test  scores.  A permissible  inference  under  one  set  of  conditions 
may  not  be  permissible  under  a quite  different  set  of  conditions.  A 
nujor  evaluation  for  many  decisions  is  whether  the  generalizability  of 
performance  in  the  best  situation  tc  performance  in  the  targeted  condi- 
tions is  a reasonable  assumption.  If  there  are  to  be  dramatic  differences 
in  conditions,  then  empirical  verification  of  generalizability  yields 
im;x)rtant  information. 

Ihe  issue  of  fairness,  which  has  been  central  in  most  discussions  of 
personnel  testing  since  the  passage  of  the  1964  Civil  Rights  Act,  should 
be  understood  as  a special  case  of  generalizability.  In  the  situation 
where  tests  are  evaluated  with  criterion- related  correlation  coefficients, 
the  issue  is  one  of  the  generalizability  of  the  regression  equation;  do 
the  constants  computed  for  a composite  of  all  croups  apply  equally  well 
to  any  identifiable  subgroups?  In  tests  which  are  evaluated  without 
such  correlations  ooeff icients,  it  nay  be  more  important  to  identify  and 
evaluate  tin?  mrjnitaie  of  various  sources  of  error.  Is  a measure  of 
performance  on  a work  sample,  for  example,  influenced  by  an  observer's 
knowledge  of  the  race  or  sex  of  the  examinee?  If  so,  to  what  extent? 

Is  the  task  so  organized  that  persons  of  unusual  height  have  a handicap 
in  the  test  situation  that  would  not  influence  performance  under  more 
realistic  conditions;  that  is,  has  the  standardization  of  the  test 
created  an  artificiality  'hat  influences  the  scores  of  some  people 
unfairly  because  it  foes  not  exir4  in  other  conditions? 


CLASSICAL  PSYCHOMETRIC  THEORY:  RELIABILITY 

It  has  been  said  that  all  measurement,  regardless  of  method  or  attri- 
bute measured,  must  be  reliable.  It  does  not  follcw,  however,  that  the 
identical  ways  of  examining  reliability  apply  in  all  cases. 

The  essence  of  an  investigation  of  reliability  is  an  assessment  of 
the  degree  to  which  variance  in  a set  of  measurements  may  be  attributed 
to  error.  Different  kinds  of  variables,  and  different  methods  of  measure- 
ment, are  susceptible  to  different  sources  of  error.  Moreover,  different 
methods  of  estimating  reliability  are  sensitive  to  different  sources  of 
error  (Stanley,  1971) . 

With  many  forms  of  measurement,  one  is  especially  interested  in  the 
stability  of  the  description  over  time,  that  is,  in  errors  due  to  insta- 
bility of  measurement.  However,  error  over  time  is  not  relevant  to  all 
measurements.  Blood  pressure,  for  example,  is  not  stable  over  time;  it 
varies  according  to  activity  level,  tension,  etc.  Yet  failure  bo  find 
the  same  blood  pressure  under  these  different  conditions  would  never  be 
considered  to  be  an  error  of  measurement ; it  is  simply  a valid  reflection 
of  the  changes  that  occur  in  the  attribute  being  measured.  On  the  other 
hand,  measures  that  are  supposed  to  represent  relatively  enduring 
traits,  such  as  behavioral  habits  or  personality  characteristics,  should 
stay  rather  constant,  at  least  in  reasonably  similar  conditions,  over 
some  substantial  period  of  time.  Variations  over  time  in  measurement  in 
these  cases  constitutes  error.  Changes  in  obtained  measurement  over 
time  are  likely  to  be  considered  sources  of  error  for  measures  of  person- 
ality traits,  cognitive  skills,  motor  skills,  job  knowledge,  and  most 
measures  of  performance  or  proficiency.  It  is  probably  inappropriate  to 
treat  change  over  time  as  a source  of  error  in  measurement  for  most 
physical  cr  attitudinal  variables. 


- 11  - 


* V 


A general  principle  in  measurement  is  that  one  should  measure  one 
attribute  at  a time.  The  standard  way  of  measuring,  exemplified  by 
typical  tests,  is  to  use  many  fallible  operations  to  measure  the  same 
thing  and  accumulate  observations.  Thus  a test  will  consist  of  many  items, 
each  with  different  specific  content  but  each  presumably  tapping  or  re- 
fltx-ting  the  same  fundamental  attribute,  — that  is,  each  a miniature 
test.  The  total  test  is  a sample  of  observations  from  a homogeneous 
universe.  Behavioral  statements  in  rating  scales  constitute  a similar 
example,  as,  perhaps,  do  pieces  of  information  obtained  through  records. 
Obviously,  observation  of  behavior  over  time  can  likewise  be  divided  into 
"items."  In  short,  tests  can  often  be  said  to  consist  of  component  parts, 
each  of  which  constitutes  an  independent  observation  of  the  same  variable. 
If,  ticwever,  one  of  these  components  proves  to  reflect  an  attribute  other 
than  that  being  measured,  the  inclusion  of  that  component  in  the  total 
leads  to  an  error  of  measurement.  An  item  that  measures  something  differ- 
ent from  the  rest  of  the  items  is  a contaminating  item.  An  observation 
taken  during  a time  period  where  a sharp  noise  or  other  distraction 
occurs  is  a contaminating  observation  since  it  reflects  behavior  under 
distraction  rather  than  behavior  under  attention  to  the  task  at  hand.  It 
is  conventional  to  refer  to  studies  of  errors  in  sampling  the  observations 
as  estimates  of  internal  consistency  or  homoceneity  in  measurement. 

Homogeneity  should  be  a pervasive  concern  in  all  measurement,  and 
it  is  virtually  assur'd  u.  fundamental  measurement . Tests,  on  the  other 
hand,  represent  wry  small  samples  of  nearly  infinite  populations  of 
pt  yible  items  and  are  esp-cjaily  susceptible  to  such  sampling  errors. 

To  investigate  these  errors,  it  is  ccmion  practice  to  develop  and  compare 
parallel  forms.  As  a matter  of  fact,  classical  reliability  theory  assumes 
parallel  tost  forms  meeting  rather  stringent  definitions.  Because  it  is 
unlikely  that  non-test  approaches  to  measurement  will  i eet  the  require- 
ments for  paral  1*' 1 forms,  Jem*. in  sampling  error  in  these  methods  may  be 
substantially  larger. 


An  analogous  reliability  problem  occurs  when  substantially  differ- 
ent (i.e. , non-parallel)  methods  are  used  to  measure  the  same  attribute. 
It  probably  matters  little  whether  one  measures  the  length  of  a board 
with  a flexible  steel  tape  measure  or  a wooden  yardstick,  but  in  some 
areas  of  physical  measurement  the  alternative  approaches  are  anything  but 
parallel.  Measuring  distances  in  cartographic  analysis  by  triangulation 
is  in  no  sense  parallel  to  measuring  with  a ruler.  If  the  two  methods 
give  different  results,  one  or  both  of  them  may  be  wrong,  a simple  corre- 
lation to  demonstrate  that  the  rrethods  are  inconsistent  is  not  a suffi- 
cient basis  for  assigning  error  to  either  one. 

If  observers  are  the  instruments  of  measurement,  there  is  probably 
a finite  nuirber  (greater  than  one  or  two)  of  possible  observers.  The  one 
or  two  actual  observers  used  are  samples  from  the  universe  of  possible 
observers  and,  as  such,  may  be  sources  of  measurement  error. 

If  two  observers  are  used,  each  may  contribute  a unique  error  or 
measurement,  and  it  is  necessary  to  determine  the  degree  to  which  any 
composite  measure  is  subject  to  error  in  sampling  observers  and  the 
degree  of  such  error  should  be  assessed.  There  is  no  way  to  estinate 
the  error  due  to  sampling  observers  if  only  one  observer  is  used,  just 
as  there  is  no  way  to  ascertain  error  attributable  to  a specific  set  of 
questions  if  only  one  form  of  the  test  is  used.  Repeated  samples  of 
observers  (or  tests)  are  required  to  estimate  the  degree  to  which  the 
sampling  introduces  error  into  the  measurement.  Likewise,  if  ratings 
are  used,  some  error  nay  be  due  to  the  raters  chosen,  and  it  can  only 
be  evaluated  by  determining  the  degree  of  agreement  between  raters. 
Another  exanple  calls  for  estimating  agreement  among  scorers  of  open- 
ended  test  items. 

fteny  forms  of  measurement  involve  a subjective  assignment  of  people 
or  objects  to  scaled  categories.  An  attempt,  to  measure  aggressive  ten- 
dencies under  conditions  of  provocation  in  an  assessment  center,  for 


- 13  - 


example,  mi  ’it  present  an  assessee  with  an  anger-producing  situation,  and 
observers  nay  be  instructed  to  determine  whether  the  response  fits  better 
in  a category  described  as  turning  white  and  silent,  or  a category  des- 
cribed as  verbal  expressions  of  anger,  or  a category  described  as  using 
[physical  movements  symbolic  of  attack.  Such  measurement  poses  two  quite 
different  kinds  of  reliability  problems.  One  is  the  degree  to  which 
observers  nay  agree  on  their  observation  of  behavior;  the  other  is  the 
degree  to  which  the  numbers  assigned  to  the  categories  fall  along  a 
reproducible  scale.  (The  point  of  view  taken  here  is  that  the  Guttman 
index  of  reproducibility  is  a special  case  of  reliability.) 

Still  another  potential  source  of  error  can  be  broadly  identified 
as  a condition  of  measurement.  The  results  of  measurement  may  be  differ- 
ent if  the  measurement  is  taken  in  the  morning  or  in  the  late  evening, 
it  may  be  different  if  it  is  taken  under  sanitary,  optimal  conditions 
rather  than  in  less  pleasant  but  more  realistic  field  conditions.  Know- 
ledge of  the  extent  of  such  errors  my  often  be  useful,  even  if  not  often 
available. 

Gross  estimates  of  most  of  these  sources  of  error  can  be  estimated 
by  conventional  methods  of  estimating  reliability  by  internal  consistency 
coefficients,  coefficients- of  equivalence,  coefficients  of  stability,  or 
coefficients  of  agreement  (conspect  reliability) . However,  these 
coefficients  simply  do  not  do  a particularly  clean  job  of  separating  out 
the  components  of  error  associated  with  attributes  of  the  person,  method, 
or  setting  other  than  the  attribute  measured.  If  one  computes  an 
internal  consistency  coefficient,  a stability  coefficient,  and  an  equiva- 
lence coefficient,  determines  the  proportion  of  variance  attributable  to 
each  of  the  three  kinds  of  error  implied  by  those  coefficients,  and  adds 
them  up,  the  total  estimate  of  error  variance  obtained  is  far  greater 
than  is  realistic.  A preferred  approach  is  the  genera lizability  analysis, 
or  multiple  facet  analysis,  advocated  by  Cronbach,  Gleser,  Nanda,  and 


Rajaratnam  (1972) . Using  analysis  of  variance  designs,  such  analysis  can 
examine  the  components  of  error  most  likely  to  be  problems  in  a specified 
besting  program. 


Generalizability  studies  seem  especially  useful  for  estimating  errors 
in  evaluating  per formance  on  a walk  sample.  Such  performance  ray  be  a 
function  of  the  instructions  given,  the  person  who  administers  or  scores 
the  test,  the  activities  preceding  testing,  and  the  environmental  setting 
in  which  performance  is  being  measured.  One  may  evaluate  these  sources 
of  potential  error,  with  explicit  estimates  of  the  proportionate  total 
variance  attributable  to  each  source,  through  the  use  of  analysis  of 
variance  designs. 

CLASS I GM,  PSYCHOMETRIC  THEORY:  VALIDITY 

Validity  refers  to  an  evaluation  of  the  quality  of  inferences  drawn 
from  test  scores,  qualitative  strmvirios , 'judgments,  or  other  measurements. 
The  first  point  of  importance  in  that  statement  is  that  validity  is  not 
a fact;  it  is  an  evaluation.  Moreover,  it  is  a quantitative  evaluation. 
It  is  best  to  think  of  validity  as  expressible  only  in  broad  categories: 
high  validity,  satisfactory  validity,  nr  poor  or  no  validity.  Depending 
on  the  context,  one  may  compare  validities  and  say  that  validity  in  one 
circumstance  is  better,  or  equal  to,  or  \sorse  than  validity  in  another. 
Since  such  statements  do  not  denote  precise  quantities,  they  are  not 
expressible  with  precise  numbers,  (be  should  not  confuse  an  evaluative 
interpretation  of  validity  with  an  obtained  validity  coefficient. 

Validity  is  not  measured;  it  is  inferred.  Although  validity  coefficients 
may  be  computed,  the  inference  of  validity  is  based  on  such  coefficients, 
not  equated  with  them. 

There  has  been  a kind  of  colloquial  shorthand  in  psychometric  English 
in  which  people  tend  to  speak  of  the  "validity  of  a test."  Informed 


- IS 


r 


people  do  not  mean  that  phrase  literally.  It  is  simply  a shorthand 
phrase  referring  to  the  evaluation  of  the  inferences  one  draws  f ran 
scores  obtained  on  a test.  (In  this  paper  as  in  others,  the  shorter 
phrase  will  undoubtedly  be  used  frequently,  but  there  should  be  no  mis- 
understanding about  its  meaning.)  Speaking  precisely,  validity  refers  to 
evaluations  of  specific  inferences  that  may  be  drawn  from  scores,  not  to 
evaluations  of  properties  of  tests,  and  there  are  as  many  validities 
as  there  are  inferences  to  be  drawn  from  the  scores.  In  evaluating  the 
total  testing  program,  many  test  properties  should  be  evaluated,  such  as 
degree  of  standardization,  adequacy  of  content  sampling,  and  the  like. 
Ihese  properties  nay  contribute  to  one's  evaluation  of  the  validity  of 
certain  inferences  from  scores,  but  they  should  not  be  confused  with  such 
inferences. 

Validation  refers  to  the  processes  of  investigation  from  which  the 
validity  of  certain  inferences  from  scores  may  itself  be  inferred  or 
evaluated.  All  validation  procedures  are  in  seme  sense  empirical.  Some 
of  these  procedures  involve  correlating  test  scores  with  other  data,  com- 
paring correlations,  doing  experimental  studies  to  determine  differences 
in  scores  for  groups  differing  in  attributes  or  treatments,  or  evidences 
of  procedures  used  in  the  construction  of  a test.  Where  the  evidence  of 
validity  is  drawn  from  correlations  of  scores  with  other  measures,  the 
validation  does  not  consist  simply  of  computing  the  correlation  coeffi- 
cient; it  consists  of  the  entire  research  process,  including  sampling  of 
persons  and  of  situations,  the  evaluations  of  other  forms  of  validities 
of  the  measures  used,  the  evaluation  of  the  logic  of  the  hypothesized 
relationship  between  the  variables,  and  the  purely  procedural  care  with 
which  data  were  collected.  These  are  also  empirical  events,  and  the 
argument  for  interpreting  scores  on  the  one  test  as  [^emitting  valid 
inferences  about  the  variable  measured  by  the  other  one  is  supported  (if 
at  all)  by  the  entire  chain  of  empirical  evidence,  not  just  the  correla- 
tion coefficient. 


- 16 


lb  say  that  certain  kinds  of  inferences  firm  scores  are  valid  infer- 
ences, therefore,  implies  not  only  the  empirical  process  of  gathering  data 
but  the  logical  process  of  evaluating  all  of  the  avai1  hie  evidence. 

Implied  in  the  foregoing  is  a final  observation  about  the  nature  of 
validity:  in  classical  psychometric  theory,  validity  refers  to  a set  of 
scores.  Hie  evidence  upon  which  validity  may  be  claimed  applies  to  the 
score  of  a sinqle  individual  only  if  that  score  can  be  interpreted  with 
reference  to  an  entire  set  of  scores.  That  is,  in  classical  interpreta- 
tion of  scores,  the  individual  score  is  considered  more  or  less  valid 
only  if  it  has  been  previously  determined  that  a set  of  scores  from  other 
individuals  tested  in  the  same  way  is  a more  or  less  valid  set  of  scores. 
Validity  is  therefore  defined  in  terms  of  variances;  validity'  is  the 
proportion  of  total  variance  relevant  to  the  purposes  of  testing;  irrele- 
vant sources  of  variance  reduce  validity.  A correlation  coefficient 
describing  the  relationship  of  one  measure  to  another  is  simply  a means 
of  describing  the  shared  variance. 

In  short,  to  make  judgments  about  the  validity  of  the  inferences  one 
may  draw  from  a sot  of  scores  is  to  make  judgments  about  the  irrelevant 
components  in  a set  of  scores.  Earlier  discussions  referred  to  evaluations 
of  single  scores  as  the  degree  to  winch,  a score  is  free  from  reflections 
of  attributes  other  than  the  one  intended.  The  classical  way  to  ascertain 
that  freedom  is  to  determine  the  level  of  irrelevant  sources  of  variance. 
This  discussion  of  validity  in  general,  therefore,  has  reflected,  without 
explicitly  referring  to  them,  the  aspects  of  validity  identified  in  the 
Standards  for  Educational  and  Psychological.  Tests  (APA,  ot.  al.,  1974). 

CRITERION- RELATED  VALIDATION 

At  the  most  directly  empirical  level  are  the  criterion-related 
validities,  predictive  and  concurrent.  For  convenience,  the  many  reasons 


- 17 


! 


l 


for  conducting  criterion-related  validity  studies  can  be  set  in  two  cate- 
c.  aries:  (a)  to  investigate  the  meaning  that  may  be  attached  to  scores 
on  a test,  that  is,  to  identify  more  clearly  the  variable  or  variables 
measured,  and  (b)  to  investigate  the  utility  of  the  scores  as  indicators 
or  predictors  of  other  variables. 

The  first  of  these  grows  out  of  the  historical  definition  of  validity 
as  the  extent  to  which  a test  measures  what  it  "purports"  the  measure. 

Tf  one  has  developed  a test  "purporting"  to  measure  scholastic  aptitude, 
then  the  "real"  treasure  of  aptitude  is  how  well  one  does  in  school  (Hull, 
1928) . School  performance  is  then  the  criterion  of  hew  good  the  test  is. 
'Ihat  is,  the  correlation  between  scores  on  the  test  and  grades  in  school 
is  an  index  of  the  success  of  the  test  in  measuring  what  it  was  supposed 
to  measure.  The  same  logic  is  sometimes  found  in  modem  instances  in 
which  a test  of,  let  us  say,  verbal  ability  is  correlated  against  super- 
visory ratings  of  verbal  ability. 

'Ihis  kind  of  validation,  although  it  involves  computing  a correlation 
between  scores  on  the  test  being  validated  and  another  measure  called  a 
criterion,  is  better  discussed  under  the  heading  of  construct  validity. 

That  is,  in  the  more  conventional  language  of  the  last  quarter  century, 
such  criterion-related  studies  are  dene  for  the  purpose  of  verifying  the 
interpretation  of  scores  in  terms  of  designated  constructs. 

It  is  an  obvious  outgrowth  of  concern  for  criterion- related  validity 
that  one  finds  that  the  criterion  of  "real"  aptitude  is  often  a variable 
of  great  importance,  and  the  utility  of  the  test  as  a predictor 
of  that  criterion  becomes  a matter  of  greater  interest  than  the 
theoretical  interpretation  of  the  scores  themselves.  Common 
practice  uses  the  term  criterion-related  validity  primarily  for 
those  situations  where  one  wishes  to  infer  from  a test  score  an  indivi- 
dual's standing  on  some  variable  of  interest  that  is  different  from  the 
variable  measured  by  the  test,  'Hie  latter  variable  has  been  called  a 

- 18  - 


• • •« 


m 


• • •• 


• • • ••  • 


criterion  for  historical  reasons,  but  it  is  usually  better  described  as 
a variable  analogous  to  the  independent  variable  in  experimental  studies. 

Hie  analogy  is  useful  because,  in  criterion-related  validities,  the 
inference  is  based  upon  a hypothesis.  That  is,  on  a priori  grounds,  the 
test  user  or  test  developer  hypothesizes  that  performance  on  the  test  is 
related  to  performance  on  some  other  measure,  often  of  a different  vari- 
able. Validation  in  such  cases  is  less  a matter  of  checking  an  intrinsic 
interpretation  of  test  scores  titan  of  conducting  research  on  the  hypothesis. 

In  the  field  of  personnel  besting,  at  least  for  selection,  the  hypo- 
thesis takes  the  form  that  scores  on  the  test  can  be  used  as  indicators 
of  potential  proficiency,  or  some  other  perfornvmce  variable,  on  a job. 

For  example,  on  a given  production  job  where  each  spoiled  piece  represents 
a monetary  loss  to  the  employer,  scrap  rate  is  a fundamental  measure  of 
an  economic  variable.  With  some  validation,  one  might  draw  inferences 
about  psychological  variables  from  scrap  rate  (clumsiness  or  carelessness 
are  competing  interpretations) , but  tliis  is  usually  not  the  salient  point. 
The  point  is  that  each  spoiled  piece  costs  the  organization  money.  If  it 
can  be  shown  that  a particular  dexterity  test,  or  perhaps  a particular 
test  of  knowledge,  can  predict  individual  scrap  rates  within  reasonable 
limits  of  error,  then  the  scores  on  the  tests  may  be  used  to  "infer” 

(more  accurately,  to  predict)  scrap  rate's  on  the  job,  even  though  the 
individual  has  not  yet  been  trained  or  olaced  on  the  job.  The  fact  that 
a theoretician  can  find  an  explanation  for  the  common  variance  in  the 
two  sets  of  measurements  is  relatively  trivial  in  most  cases;  rarely  is 
there  any  attempt  to  interpret  such  criterion- related  validity  coefficients 
theoretically:  what  is  interpreted  is  tiie  value  of  the  test  as  a basis 
for  predictions  of  future  performance.  What  is  conrnonly  called  test 
validation  is,  therefore,  best  understood  as  an  investigation  of  a 
hypothesis  rather  than  an  investigation  of  variables  underlying  scores 
on  either  predictor  or  criterion. 

It  is  useful  to  distinguish  between  hypotheses  that  imply  predictive 


L. 


19  - 


r 


i 

i 

I 


I 


i 

i 

i 


i 

y 

i 

* 

► 

r 


validity  and  those  for  which  concurrent  validity  is  appropriate.  It)  illus- 
trate the  difference,  consider  the  possible  finding  that  measures  of  self- 
confidence  are  substantially  and  significantly  correlated  with  proficiency 
ratings  of  leadership.  Three  independently  testable  hypotheses  are 
possible:  (a)  that  people  who  are  self-confident  become  effective  leaders, 
(b)  that  people  who  are  effective  leaders  become  self-aonfident,  and  (c) 
that  people  who  are  effective  leaders  are  self-confident.  The  first  two 
of  these  are  predictive  hypotheses;  they  predict  in  opposite  directions. 
Ignoring  the  possibility  of  reciprocal  causality,  both  of  these  hypotheses 
require  predictive  studies  to  validate  them,  but  the  design  of  the  studies 
wauld  be  substantially  different.  In  the  first  hypothesis,  one  would 
administer  the  measure  of  self-confidence  prior  to  people  gaining  exper- 
ience in  leadership  roles.  For  the  second  hypothesis,  one  would  not 
obtain  the  measure  of  self-confidence  until  people  have  been  in  the  lead- 
ership role  long  enough  to  establish  clear  and  observable  habits  of  lead- 
ership. For  the  third  hypothesis,  the  two  measures  could  be  taken  concur- 
rently. The  fact  that  very  little  benefit  may  accrue  to  anyone  fran  such 
concurrent  correlation  is  beside  the  point;  the  point  is  that  the  hypothesis 
is  a different  one  and  that  in  any  correlational  study  relating  them,  the 
procedures  of  investigation  will  be  different. 

There  has  been  an  over-reliance  on  criterion-related  validation  in 
the  history  of  personnel  testing.  The  simplicity  of  the  validity  statement 
makes  it  very  attractive,  and  it  is  often  necessary  for  specific  personnel 
purposes.  However,  things  are  rarely  as  sirrple  as  they  seem,  and  many 
factors  make  over-reliance  on  a single,  obtained  validity  coefficient 
questionable. 

First,  the  conditions  of  a validation  study  are  never  exactly  repeated. 
This  is  especially  evident  in  the  case  of  a predictive  study,  where  the 
logic  of  predictive  validation  assumes  that  the  conditions  at  the  start 
of  the  study  will  be  reasonably  well  matched  by  the  conditions  at  the 
start  of  a new  time  sequence  when  the  results  of  the  original  study  are 


- 20 


to  be  applied.  If  a validation  study  extends  over  three  or  four  years 
or  more,  new  methods  of  training,  new  equipment,  new  social  attitudes, 
new  applicant  characteristics,  and  many  other  new  things  may  change  the 
validity  before  the  results  can  be  put  to  use. 

Second,  the  logic  of  criterion-related  validity  assumes  a valid 
criterion.  Very  rarely,  however,  do  criterion- related  validity  reports 
give  any  evidence  of  the  validities  of  inferences  drawn  from  the  criterion 
measures  themselves.  All  too  often,  personnel  testing  uses  unvalidated 
supervisory  ratings  as  the  criterion.  In  many  of  these  cases,  a criterion- 
related  validation  study  is  probably  inadvisable. 

Third,  the  loqic  of  criterion-related  validity  assumes  that  the 
sample  of  an  applicant  population  used  for  research  is  truly  representative 
and  that  the  validity  will  generalize  to  later  samples.  This  is  almost 
always  violated  to  seme  degree,  if  only  through  bias  in  attrition.  Statis- 
tical procedures  can,  of  course,  provi do  better  estimates  of  population 
validities  than  those  provided  by  the  biased  sample,  but  the  assnrptions 
for  these  procedures  often  are  not  satisfied. 

Finally,  results  of  criterion-related  validity  studies,  jvarticularly 
those  in  which  the  predictor  is  a composite  of  several  variables,  are 
highly  questionable  if  based  on  smell  numbers  of  cases.  The  sample  size 
necessary  to  conduct  a competent  investigation  of  cri ter ion- related 
validity  is  much  larger  than  was  earlier  supposed  (Schmidt,  Hunter,  f, 

Urry,  1970) . 

CONSTRUCT  VALIDITY 

Despite  the  foregoing  warnings , studies  of  criterion- rel atod  val idi ty 
are  basic  in  investigations  of  construct  validity.  Where  the  criterion 
is  chosen  because  it  can  shed  light  on  the  intrinsic  meaning  of  the  scores 
being  validated,  such  studies  enable  one  to  sharpen  jx>ssible  interpretations 

- 21  - 


• I • 


k' 


of  test  scores  and  to  choose  between  carpeting  interpretations.  In  this 
context,  lew  validities  can  be  as  helpful  as  high  validities  if  they 
indicate  what  the  test  does  not  measure  and  thereby  limit  the  nature  of 
the  variables  legitimately  inferred  from  the  scores. 

Construct  validity  is  not  a utilitarian  notion.  It  is  studied 
because  one  wishes  to  increase  his  understanding  of  the  psychological 
qualities  being  measured  by  the  use  of  a particular  test.  Such  studies 
influence  the  degree  of  confidence  one  may  have  in  the  accuracy  of  des- 
criptive inferences  about  the  individual  tested.  A test  is  ordinarily 
supposed  to  be  a measure  of  seme thing;  that  something  is  an  idea  or  con- 
cept of  a variable;  if  sufficiently  sophisticated  scientifically,  it  is 
called  a hypothetical  construct.  The  latter  term  is  intended  to  enphasize 
an  idea  that  has  been  constructed  as  a way  of  organizing  knowledge  and 
experience  — that  is,  a construct  is  a work  of  scientific  imagination. 

As  evidence  accumulates  about  a construct,  the  idea  may  change. 

The  essential  logic  of  construct  validation  is  disoonf irmatory. 

One  does  research  designed  to  disconfirm  an  intended  interpretation  by 
persistently  trying  alternative  interpretations;  that  is,  one  investi- 
gates the  possibility  that  a variable  other  than  the  one  intended  to 
be  measured  (other  than  what  a test  "purports"  to  measure)  is  a better 
interpretation  of  the  scores.  Variance  in  a test  intended  to  be  used 
for  inferring  problem-solving  abilities  may  in  fact  be  substantially 
contaminated  by  variance  due  to  individual  differences  in  reading  abil- 
ity. Or  a newly  proposed  construct  may  prove  to  be  an  old  variable  con- 
ventionally measured  by  other  means.  In  either  case,  the  aim  of  the 
research  is  to  strengthen,  if  possible,  a given  interpretation  of  the 
test  by "Shbwin'g  that  alternative  interpretations  are  not  feasible.  Of 
course,  if  the  alternative  interpretation  turns  out  to  be  a fairly  solid 
one,  then  perhaps  the  originally  intended  interpretation  is  the  one  that 
is  infeasible. 


- 22  - 


A 


1 


The  notion  of  a hypothetical  construct  in  its  usual  context  is  a 
fairly  sophisticated  scientific  construct  itself.  Reference  in  discussions 
of  hypothetical  constructs  deal  with  "nomological  networks"  of  scientific 
lawfulness  (Cronbach  & Msehl,  1955).  The  logic  and  disconfirmatory 
emphasis  of  construct  validation  can,  however,  be  very’  useful  for  ideas 
that  are  much  less  well  developed  scientifically.  Supervisory  ratings 
of  work  proficiency  can,  for  exanple,  be  evaluated  in  terms  of  construct 
validity.  In  this  case,  the  construct  is  iot  a highly  developed  creation 
of  scientific  imagination;  it  is  a rather  vague  idea  of  proficiency  on 
a specific  job.  The  question  is  not  the  scientific  import  or  sophistica- 
tion of  the  idea,  but  whether  proficiency'  is  a reasonable  interpretation 
of  the  variable  measured  by  the  ratings.  Disconfirmatory  research  would 
consider  alternative  explanations.  Perhaps  the  ratings  merely  measure 
how  long  ratees  have  been  known;  therefore,  studies  would  be  initiated 
to  determine  the  relationship  of  length  of  aoquantance  to  the  ratings. 

This  is,  of  course,  a corrplex  question.  A mere  correlation  between  length 
of  acquaintance  and  ratings  may  identify  bias,  or  it  irav  shew  that  exper- 
ience does  in  fact  count  on  that  job.  These  are  competing  inferences 
frem  a correlation,  and  yie  logic  of  construct  validity  requires  that 
one  attenpt  bo  evaluate  them  and  to  choose  between  them.  In  seme  cir- 
cumstances, this  might  require  another  research  study.  In  other  circum- 
stances, it  may  merely  require  an  exercise  in  loqic;  if  the  job  can  be 
learned  in  a few  days,  or  if  proficiency  is  limited  by  external  forces 
such  as  supply  of  material  bo  the  worker  or  the  speed  of  a conveyor  belt, 
the  hypothesis  that  greater  experience  results  in  greater  proficiency 
is  probably  silly  arvd  the  correlation  would  disconfirm  the  desired 
interpretation  of  the  ratings.  ...  . 

Tb  say  that  valid  inferences  can  be  drawn  about  a specified  construct 
is  to  say  little  or  nothing  about  the  utility  of  the  measure  for  practical 
decisions.  In  a personnel  selection  situation,  for  exartple,  the  practical 
utility  of  the  measure  depends  less  on  hew  well  it  measures  a given  construct 


- 23  - 


r i 


! than  on  hew  well  the  scores  will  predict  future  performance , regardless 

of  what  or  how  many  constructs  they  reflect. 

i As  pointed  out  in  the  preceding  section,  in  many  circunstances  cri- 

terion-related validity  is  not  feasible.  In  these  situations  one  exer- 
cises the  logic  rather  than  the  arithmetic  of  predictive  validity.  Else- 
where (Guion,  1976) , the  author  has  used  the  term  "the  rational  foundation 
i for  predictive  validity"  for  situations  in  which  construct  validity  is 

\ evaluated  as  part  of  that  logic.  The  phrase  irtplies  that  the  logic  of 

| construct  validation  and  the  logic  of  predictive  validation  meet  if  a 

| predictive  hypothesis  is  very  carefully  developed.  The  steps  of  careful 

development  include  careful  job  analysis,  rational  inferences  from  the 

l 

information  obtained  in  the  job  analysis  about  the  kinds  of  constructs 
l that  may  be  hypothesized  as  relevant  to  performance  on  the  job  as  it 

, would,  if  it  could,  be  measured,  and  finally  the  identification  of  pre- 

dictor variables  that  will  validly  measure  those  constructs.  Such  a 
f"'  logical  argument  pools  a great  deal  of  empirical  information:  the 

| observations  of  the  job,  the  group  judgments  involved  in  inferring  the 

^ constructs,  and  the  evidence  of  the  construct  validities  of  the  predictors. 

! 

None  of  this  enpirical  information  is  necessarily  expressed  as  validity 
i coefficients,  yet  to  infer  that  high  scores  on  the  predictors  predict 

high  performance  on  the  job  is  arguably  more  valid  under  these  cir cun- 
stances  than  when  a validity  coefficient  is  obtained  frem  an  inadequate 
study. 

COWTEWT  VALIDITY 

Content  validity  is  a special  case  of  construct  validity.  It  is 
likely  to  be  emphasized  in  measuring  knowledge  or  performance ‘variables , 
and  it  is  especially  frequently  invoked  in  evaluations  of  work  samples. 

For  that  reason,  it  will  be  considered  here  in  particular  detail. 

The  "construct"  when  one  speaks  of  content  validity  is  more  obvious 


- 24  - 


" than  where  reference  is  to  more  abstract  constructs:  level  of  knowledge , 

level  of  skill,  level  of  competence,  or  degree  of  rrastery  of  a specified 
content  or  skill  domain.  Tt  has  been  custcrery  to  speak  of  content 
validity  when  one  wishes  to  infer  from  scores  on  a test  reflecting  the 
probable  performance  in  a larger  detrain  of  which  the  test  is  a sample. 

. '3 

Ws  have  already  referred  to  dcmiin  sanpling  in  sairpling  the  kinds  of  items 
that  measure  a construct;  the  concern  here  is  for  detrain  sairpling  where 
the  domain  is  more  intuitively  understood. 

1 Content  validity  began  in  educational  measurement  as  a straightforward 

!■  concept  which  posed  no  special  problems.  An  educational  curriculum  iden- 

tifies an  explicit  body  of  knowledge  and  instructional  objectives,  and 
i educational  practice  has  decreed  that  asking  a < question  about  specific 

k,  knowledge  is  an  acceptable  operation  for  measuring  it.  'lire  re  fore,  if  one 

had  all  possible  questions  about  a specified  curriculum  content,  one 
i could  obtain  a universe  or  domain  score  by  adding  up  the  number  of  items 

answered  correctly.  When  one  takes  a sanple  of  all  possible  items  from 
| that  domain,  one  can  add  up  the  nunfcer  of  items  answered  correctly  and, 

| from  that  score,  infer  something  about  the  number  or  proportion  of  items 

? that  would  have  been  answered  correctly  had  the  entire  domain  been  used. 

! 

I 

This  account  is  perhaps  unnecessarily  glib,  but  the  glibness  gives 
it  brevity.  It  is  acknowledged  that  the  best  practice  in  sanpling  con- 
tent domains  defined  by  educational  curricula  w old  utilize  what  Cronbach 
(1971)  called  the  universe  of  admissible  operations,  which  identifies 
stimulus-response  content  in  terms  of  the  permissible  kinds  of  questions 
and  the  expected  kinds  of  responses.  Nevertheless,  the  glibness,  if 
: that  is  what  it  is,  seems  defensible  because  the  universe  of  admissible 

operations  in  educational  testing  is  reasonably  restricted.  A combina- 
tion of  curriculum  identification  and  conventional  practice  relieves  many 
1 questions  that  might  otherwise  arise. 


»•  * ' ■“  v 1 ■ f • ••  t » « *••*'•••<#  •«•••*  « • 


I 


In  personnel  testing,  hcwever,  the  concept  of  content  validity  has 
been  nuch  more  troublesome.  The  definition  of  a content  darrein  has  been 
a source  of  great  confusion,  and  it  is  therefore  necessarily  difficult 
to  define  a universe  of  admissible  operations  for  measuring  a domain  one 
does  not  clearly  understand.  Perhaps  nowhere  is  the  confusion  better 
documented  than  in  the  Standards  (APA,  et  al. , 1974).  In  its  discussion 
of  the  applicability  of  content  validity  to  enployment  testing,  that 
document  points  out  that  "the  performance  domain  would  need  definition 
in  terms  of  the  objectives  of  measurement,  restricted  perhaps  only  to 
critical,  most  frequent,  or  prerequisite  work  behaviors."  TVo  paragraphs 
further,  on  the  same  page,  we  read,  "An  employer  cannot  justify  an 
employment  test  on  grounds  of  content  validity  if  he  cannot  demonstrate 
that  the  content  universe  includes  all,  or  nearly  all,  important  parts 
of  the  job"  (p.  29) . 

Job  Content  Universe.  In  attempting  to  clarify  matters,  it  may  be 
useful  to  distinguish  between  the  terms  universe  and  dcrrain  and  between 
job  content  and  test  content.  We  may,  therefore,  identify  four  concep- 
tual entities:  a job  content  universe,  a job  content  dorrain,  a test 
content  universe,  ard  a test  content  domain. 

A comprehensive  jab  analysis  may  identify  all  the  nontrivial  tasks, 
responsibilities,  prerequisite  knowledge  and  skill,  and  organizational 
relationships  inherent  in  a given  job,  and  all  of  this  defines  a job 
content  universe. 

Tasks  are  the  things  people  do;  job  analysis  need  not  identify 
trivial  tasks,  but  it  should  identify  the  most  salient  activities. 
Responsibilities  may  include  tAskfir*but  may  also  include  less -clearly 
observable  activities.  A teacher,  for  example,  may  be  responsible  for 
the  health  and  safety  of  the  children  in  her  class.  The  precise  activi- 
ties carried  out  in  fulfillment  of  that  responsibility  may  be  hard  to 
define  since  they  vary  with  changed  circumstances.  Prerequisite  knowledge 


- 26  - 


and  skill  represent  cognitive  or  motor  abilities  or  information  neces- 
sary for  effective  and  responsible  task  performance.  Such  knowledge  or 
skill  needs  to  be  defined  unambiguously;  vague  trait  names  are  not  enough. 
"Must  be  able  to  compute  means,  standard  deviations,  correlation  coeffi- 
cients, and  probability  estimates"  is  a far  more  explicit  statement  than 
saying,  "Must  have  knowledge  of  statistics." 

Organizational  relationships  place  the  job  in  its  context;  they 
identify  systems  of  ideas,  materials,  or  social  relationships  as  they 
influence  the  job;  dependencies  that  nay  exist  in  sequences  or  task  per- 
formance, and  the  degree  to  which  people  in  other  jobs  must  depend  on 
the  incumbent  in  the  job  being  analyzed  in  doing  their  job.  Both  the 
organizational  relationships  and  the  responsibilities  describe  not  only 
the  content  of  the  job  but  the  content  of  the  consecjuoncos  of  the  perform- 
ance of  a job. 

If  the  job  is  at  all  complex,  it  would  be  cither  impossible  or 
absurdly  impractical  to  try  to  develop  a vork  sample  test  to  natch  that 
total  job  content  universe,  it  might  be  necessary  to  carry  out  the  full 
training,  to  provide  experience,  and  to  observe  performance  on  the  actual 
job  for  a period  of  time.  If  one's  purpose  were  selection,  this  would  be 
absurdly  impractical. 

Job  Qonbont  Domain.  In  practice,  one  identifies  a portion  of  the 
job  content  universe  for  the  purposes  of  testing.  In  a stenographic  job, 
for  example,  the  portion  of  the  universe  most  salient  in  selection  or 
performance  evaluation  might  be  restricted  to  those  aspects  involvinq 
typing.  From  the  job  analysis,  one  could  identify  the  tasks,  responsibil- 
ities, and  the  prerequisite  skill  (such  as  spelling)  associated  with 
typing;  with  these  restricted  elements,  and  ignoring  other  aspects  of  the 
total  job  content  universe,  one  can  define  a task  content  domain.  In 
this  sense,  the  word  domain  is  being  used  as  a sample  (and  not  necessarily 
a representative  one)  of  the  content  implied  by  the  word  universe. 


27  - 


I 

■j 


Test  Content  Universe.  Performing  a job  and  taking  a test  are  not 
identical  activities,  even  if  the  ccnponent  elements  are  identical,  lb 
continue  with  the  stenographic  exanple,  typing  mailable  letters  from 
dictation  on  a real  job  involves  interruptions , knowledge  of  the  idiosyn- 
cracies  of  the  person  who  has  dictated  the  letters,  interruptions  by 
telephone  calls  or  requests  for  materials  from  files,  etc.  Typing  from 
the  same  dictated  material  in  a test  situation  involves  typing  under  the 
anxiety  created  by  the  testing  and  its  peculiar  motivational  characteris- 
tics, in  standard  conditions  such  that  any  distractions  are  built  into 
the  exercise  and  are  standardized  for  all  people,  and  using  material  dic- 
tated by  an  unfamiliar  voice,  lb  the  best  of  this  writer's  knowledge, 
no  one  has  ever  developed  a typing  test  that  is  a genuine  work  sample  in 
the  sense  of  duplicating  actual  circumstances,  distractions,  and  snide 
cements  on  the  dictation  tape  — nor  has  he  encountered  anyone  who  would 
advocate  it. 


Instead,  one  defines  from  the  job  content  domain  a universe  of  possi- 
ble operations  for  the  development  of  a test.  Hie  test  content  universe, 
therefore,  consists  of  all  of  the  tasks  that  might  be  assigned,  all  of 
the  conditions  that  might  be  imposed,  and  all  of  the  procedures  for 
observing  and  recording  responses  that  might  be  used  in  the  development 
of  the  content  sanple.  Hie  test  content  universe  is,  again,  a sample  of 
the  job  content  domain.  But  it  is  more  than  that;  it  includes  elements 
that  are  not  part  of  the  job  content  detrain  since  the  latter  probably 
includes  no  information  about  procedures  for  observing  and  recording 
behavior  on  assigned  tasks.  Hiis  would  be  particularly  true  if  the 
operations  decided  upon  consisted  of  a series  of  questions  about  the 
reasons  for  certain  procedures  in  carrying  out  a task;  one  would  vir- 
tually never  include  such  question-and-answer  exercises  as  a part  of 
the  actual  job,  but  they  can  be  quite  useful  in  testing  people  to  deter- 
mine their  qualifications  for  the  job. 

Hiis  is  a crucial  point  in  the  total  chain  of  argument.  In  many  kinds 


- 28  - 


of  work  sanple  testing,  psychometric  considerations  require  the  inclusion 
of  non- job  oonponents  in  defining  a test  content  done  in;  otherwise,  there 
may  be  no  measurement  of  anything.  Such  added  operations  may  involve 
ratings  by  observers,  counting  (and  perhaps  weiahting)  responses  to  ques- 
tions, or  carrying  out  physical  measurements  and  inspection  of  products 
that  go  beyond  those  encountered  in  the  actual  job  itself  but  are  neces- 
sary foundations  for  measurement. 

Tbst  Content  Domain.  The  test  content  domain  is  a sample  of  a test 
content  universe,  and  it  defines  the  actual  specifications  for  test 
construction.  Again,  the  test  content  domain  is  not  necessarily  a repre- 
sentative sample  of  the  test  content  universe.  (Questions  of  practicality 
and  of  relative  importance  must  assuredly  enter  into  the  judgments  defin- 
ing a test  content  dorrain. 

There  retains,  then,  the  actual  construction  of  the  test. 

The  Limits  of  Gontent  Sampling  as  Validity.  The  foregoing  sequence, 
which  is  illustrated  by  Figure  2,  is  not  necessary  as  a detailed  procedure, 
but  the  four-step  process  of  domain  definition  is  useful  for  clarifying 
the  relationships  of  job  and  test  domains  and  for  reconciling  the  con- 
tradictory statements  in  the  Standards . 

It  should  be  clear  that  what  has  been  called  content  validity  is 
quite  different  from  all  other  forms  of  validity.  As  a matter  of  fact, 
the  term  should  net  be  used  since  it  can  only  cause  confusion.  The  term 
validity  refers,  as  has  beer.«pointed  out,  to  an  .^valuation  of  the.  infer- 
ences that  can  be  made  from  scores.  If  the  inference  to  be  drawn  from  a 
score  on  a content  sample  is  to  be  an  inference  about  performance  on  an 
actual  job,  it  is  drawn  at  the  end  of  inferential  leaps,  in  any  one  of 
which  there  can  be  a serious  misstep.  The  crucial  chance  for  misstep  is 
in  the  definition  of  a test  content  universe;  it  is  here  that  a system 
of  scoring  (or  its  basis)  is  invented,  and  that  system  of  scoring  is 


29  - 


I 

rarely  if  ever  a component  of  the  actual  job  content  domain.  Moreover, 


the  scoring  system  is  subject  to  contamination , just  as  is  the  scoring 
of  any  other  test.  That  is,  the  obtained  score  an  individual  rrakes  iray 
reflect  the  attribute  one  wishes  to  infer,  ability’  to  do  the  job,  but  it 
nay  also  reflect  a variety  of  contaminations  such  as  anxiety,  ability  to 
comprehend  the  verbal  instructions,  or  perceptual  skills  in  seeing  cues 
for  scoring  enabling  perceptive  or  test-wise  people  to  make  better  scores 
than  others. 


All  of  this  has  a familiar  rina  after  the-  earlier  discussion  of  con- 
struct validity.  All  of  the  other  fxissiblo  oonponents  of  a score  repre- 
sent the  con tami nations  which  construct  validation,  in  its  conrnitrncnt  to 
disconf irmatory  research,  is  designed  to  investigate.  Tb>  repeat:  content 
validity  is  a special  case  of  construct  validity  (Messick,  1975;  Tenoiiyr, 
1977) . 

ACCEPTANCE  OF  OPERATIONAL  DEFINITIONS 


"Validity  has  long  been  one  of  the  major  deities  in  the 
pantheon  of  the  psychometrician.  It  is  universally  praised, 
but  the  good  warks  done  in  its  name  are  remarkably  few. 

Hast  validation,  in  fact,  is  widely  regarded  as  the  least 
satisfactory'  aspect  of  test  development....  It  is  pur- 
pose of  this  paper  to  develop  an  alternative  expio:  ion  of 
the  problem,  and  to  propose  an  alternative  solution.  The 
basic  difficulty  in  validating  many  tests  arises,  wo  believe, 
not  from  inadequate  criteria  but  from  logical  and  operational 
limitations  of  the  concept  of  validity  itself.  We  are  per- 
suaded that  faster  progress  will  be  made  toward  better  edu- 
cational and  psychological  tests  if  validity  is  given  a much 
more  specific  and  restricted  definition  than  is  usually  the 
case,  and  if  it  is  no  longer  regarded  as  the  supremely 
important  quality  of  a remittal  test"  (Ebel,  1961,  p.  640). 


With  these  words,  Ebel  began  a critiijue  of  the  concept  of  validity 
as  a rajor  basis  for  evaluating  tests.  Many  of  the  garments  made  in 
that  paper  arc  still  highly  afpli cable;  people  still  tent!  to  think  of 
validity  in  terms  of  "real"  trait.s,  they  still  accept  criterion  measures 


31 


that  have  little  if  anything  to  do  with  the  attributes  being  measured 
(and  do  not  recognize  that  in  doing  so  they  have  formed  an  external 
hyf»thesis) , and  the  concept  of  validity  is  still  far  too  broad  to  have 
scientific  utility.  Alternatives  include  the  evaluation  of  reliability, 
normative  data,  the  inportance  of  the  knowledge  or  abilities  required  by 
a test,  convenience  in  the  use  of  the  test,  and,  most  of  all,  meaningful - 
rvss. 


Meaningfulness  was  also  the  primary  yardstick  for  evaluation  proposed 
by  '-lessick  (1975) ; his  concept  of  meaning  fulness,  however,  turns  out  to  be 
■■qui valent  to  the  concept  of  construct  validity.  Ebel,  but  not  Messick, 
would  evaluate  a test  simply  as  an  operational  definition  of  an  attribute 
to  be  measured;  the  operations  provide  the  meaning. 

This  writer  takes  the  {position  that  operational  definitions  of  the 
attributes  to  be  measured  can,  under  certain  circumstances,  provide  both 
a necessary  and  a sufficient  evaluation  of  the  scores  obtained  by  using 
it;  that  is,  under  certain  circumstances,  no  statement  of  validity  is 
needed.  It  is  ope rational ism,  not  validation,  that  provides  the  meaning 
for  fundamental  measurement  of  physical  properties  of  length,  time,  or 
weight.  As  pointed  out  in  the  taxonomy  of  measurement,  for  measurement 
of  such  variables  as  these,  one  asks  not  whether  the  measurements  are 
valid  but  whether  they  are  accurate. 

Some  psycho logical  measurement  can  a) so  be  defended  as  meaningful 
iccause  of  the  operations  involved  in  the  measurement  without  recourse  to 
the  psychology's  unique  dervind  for  validating  the  inferences  from  the 
scores.  Cperationalism  does  not  always  eliminate  concern  for  validating 
inferences;  in  fact,  it  is  sufficient  only  in  relatively  restricted  cases 
(libel,  1956,  1961;  Idnopyr,  1977).  in  Tertoiiyr's  terms,  there  are  some 
constructs  for  which  the  content  of  the  measuren >.nt , i.o.,  the  operational 
definition,  is  a sufficient  evaluation.  With  reference  to  the  taxonomy 
of  variables  describing  at  tributes  of  people,  it  would  appear  that  these 


constructs  would  include  certain  physical  attributes,  psychomotor  skills, 
task  proficiencies,  and,  with  a caveat,  measures  of  job  knowledge.  (The 
caveat  is  that  scores  on  job  knowledge  tests  ray  be  unduly  influenced  by 
reading  abilities  having  little  to  do  with  the  actual  level  of  knowledge.) 

Ft>r  these  constructs,  at  least  in  jart,  it  would  seem  possible  to 
evaluate  the  job  relevance  and  meaningful  ness  of  a personnel  testing 
program  on  the  basis  of  the  operations  alone.  In  a combination  of  two 
other  publications  (Guion,  1977,  in  press),  the  writer  has  presented  a 
list  of  six  requi rerants  which,  if  met,  constitute  a sufficient  evaluation 
of  the  use  of  a test  so  that  issues  of  validity  need  not  arise.  With  some 
modifications  to  fit  the  present  context,  and  with  emphasis  on  personnel 
testing  and  judgment  of  job  relatedness,  these  will  be  reproduced  here. 

First,  the  content  domain  mast  consist  of  behavior  the  meaning  of 
which  is  generally  accepted.  At  the  risk  of  sounding  like  Gertrude 
Stein,  we  can  say  that  doing  something  (like  driving  a car)  is  generally 
accepted  as  evidence  of  the  ability  to  do  it.  If  a person  reads  a passage, 
it  means  that  he  can  read  the  passage;  if  he  does  not  read  the  passage, 
it  ray  not  mean  an  inability  to  read  it  (Messick,  1975),  but  it  certainlv 
moans  that  he  did  not.  In  such  examples,  the  meaning  of  the  behavior  is 
obvious;  it  roijui res  no  great  inferential  leap  to  interpr  or  to  draw 
inferences  from  the  behavior  samples. 

Second,  both  the  test  content  domain  and  the  job  content  domain 
should  be  unambiguously  defined.  The  domains  should  be  defined  well  enough 
that  people  who  disagree  on  the  definition  can  nevertheless  agree  on 
whether  a particular  task  or  statement  or  item  belongs  in  or  out  of  the 
domain.  In  the  present  age  of  litigation,  agreements  on  the  definition 
of  a content  demain  are  always  tenuous.  The  amount  of  agreement  needed 
does  not  dejxand  on  nailinq  dewn,  in  very  precise  language,  every  conceiv- 
able component  of  a domain.  It  is  enouqh  that  the  boundaries  of  the 
domain  are  sufficiently  well  established  for  agreement  .among  reasonable 
and  knowledgeable  people. 


33  - 


"Turd,  the  test  content  domain  must  be  relevant  to  the  job  content 
domain.  The  question  of  relevance  is  again  a natter  of  judgment,  and 
judgment  requires  sane  evidence  of  agreement.  In  originally  presenting 
tins  third  condition,  the  lack  of  a measure  of  the  degree  of  agreement 
of  the  domains  was  deplored;  it  new  seems  that  the  extent  of  agreement 
.imong  qualified  judges  that  the  two  are  comparable  is  sufficient. 

Fourth,  qualified  judges  must  also  agree  that  the  test  content  domain 
lias  been  adequately  sampled.  The  need  to  define  what  is  meant  by  qualified 
-judges  is  particularly  strong  in  this  condition.  From  the  point  of  view 
of  personnel  testing,  the  best  qualified  judges  are  usually  people  who 
have  done  the  job  in  question  or  who  have  supervised  the  performance  of 
that  job.  The  retired  level  of  agreement  would  appear  to  be  minimally 
that  necessary  to  avoid  conflict.  Disagreements  differ  qualitatively. 

Some  tjualified  judges  will  disagree  on  semantic  grounds;  others  may  dis- 
agree because  of  f"ndamental  differences  in  value  systems.  The  disagree- 
ment between  plaintiff  and  defendant  is  a serious  level  of  disagreement; 
the  disagreement  between  one  who  would  suggest  a slight  change  in  wording 
and  one  who  prefers  the  existing  wording  is  not  a profound  disagreement 
and  need  not  be  taken  seriously  in  evaluating  domain  sampling.  The  ques- 
tion, therefore,  is  whether  there  is  a consensus  (a  majority  view)  and 
whether  there  is  a reasonable  freedom  from  dissatisfaction  with  the  con- 
sensus on  the  part  of  most  qualified  judges.  This  requirement  holds 
for  defining  the  boundaries  of  a content  domain,  for  judging  the  rele- 
vance of  a test  content  dona  in  to  a job  content,  domain,  and  for  judging 
the  adequacy  of  the  sampling  of  the  test  content  domain. 

Fifth,  the  response  portion  of  the  testing  must  be  reliably  observed 
ard  evaluated.  In  the  original  presentation  of  this  point,  it  was  said, 
"This  cbes  not  refer  to  internal  consistency,  of  course"  (Guion,  1977,  p. 

7) . The  phrase  "of  course"  is  now  regretted.  At  the  very  least,  any 
measurement  should  have  some  degree  of  functional  unity;  if  there  is  not 
even  enough  internal  consistency  far  sionificant  correlations  to  exist 


A 


34  - 


between  the  component  parts  of  a content  sample,  then  the  score  of  the 
content  sanple  should  be  subdivided  into  reasonably  internally  consistent 
components . This  Garment,  it  should  be  pointed  out,  is  a necessary  conse- 
quence of  saying  that  what  has  passed  for  content  validity  is  in  fact  a 
special  case  of  construct  validity;  the  first  requirement  of  construct 
validity  is  internal  consistency. 

A more  important  implication  of  this  requirement  is  that  observers 
who  record  observations  must  agree  reasonably  well  on  what  they  have  seen. 
If  the  behavior  to  be  observed  is  not  defined  well  enough  to  permit  inter- 
observer agreement,  it  violates  the  first  condition  of  an  operational 
definition  based  on  content  sampling. 

Sixth,  the  method  of  scoring  the  content  sample  must  be  generally 
accepted  by  qualified  judges  as  relatively  free  from  contaminants  reflect- 
ing irrelevant  attributes  of  examinees  or  attributes  of  observers  or 
materials.  This  implies  no  stringent  demands  for  agreement  among  the 
judges.  If  there  is  a serious  suggestion  of  contamination  from  judges 
who  have  made  the  previous  judgments,  some  study  inquiring  into  the  con- 
struct validity  of  the  scores  nay  be  necessary'. 

Intrinsic  Validity.  A different  approach  to  operational ism  can  be 
drawn  from  a parallel  to  the  concept  of  intrinsic  validity  (Gulliksen, 
1950),  another  way  in  which  the  meaning fulness  of  an  operational  defini- 
tion can  be  known  by  its  outcomes.  For  example,  if  an  examinee  is 
coached  to  take  the  test,  and  coaching  for  the  test  improves  both  test 
performance  and  performance  on  the  job,  then  scores  on  the  test  are 
intrinsically  related  to  performance  on  the  job.  The  investigation  of 
this  relationship  is,  of  course,  an  empirical  investigation;  it  does 
not  rest  upon  the  consensus  of  cjualified  judges.  Nevertheless,  it  is 
only  remotely  related  to  its  closest  cousin  among  the  validities,  cri- 
terion-related validity.  For  the  test  to  be  accepted  as  an  operational 
definition,  under  this  heading,  not  only  must  a correlation  between 


35  - 


test  performance  and  job  performance  be  obtained,  but  it  must  not  be 
lost  as  a consequence  of  coaching. 


Operational  ism  Based  on  Formal  Structure.  If  work  sanple  perform- 
ance is  to  be  evaluated  by  evaluating  the  product  of  that  performance, 
and  the  product  is  a tangible  object,  then  the  measurement  my  consist  of 
measuring  weight,  conductivity  of  solder  connections,  the  amount  of 
stress  needed  to  break  a weld,  or  similar  physical  measurement.  Such 
measurements  are  formal,  fundamental  measurements  and  they  need  no  justi- 
fication by  recourse  to  notions  of  validity. 

The  logic  of  formal  measurement  could  be  extended  to  seme  other  areas 
of  psychological  measurement.  TOo  possibilities  seem  worth  mentioning 
which,  if  tests  could  be  successfully  constructed  by  these  methods,  would 
provide  fornal  measurement  that  should  be  accepted  without  any  concern  for 
notions  of  validity.  One  of  these  uses  Guttman  scaling  for  content- 
referenced  interpretations  of  scores;  the  other  applies  latent  trait 
theory.  These  will  be  discussed  in  detail  in  the  next  section. 

CHALLENGES  TO  CLASSICAL  THEORY 

Classical  psychometric  theory  has  its  origin  in  the  study  of  indivi- 
dual differences.  This  study  requires  maximum  distinctions  between 
individuals,  that  is,  maximum  variances  within  groups.  All  of  classical 
theory  is  based  upon  variance  and  upon  the  subdivision  of  variance  into 
systematic  and  error  sources.  A test  is  said  to  be  reliable,  for  example, 
to  the  extent  that  the  variance  in  a set  of  scores  obtained  through  its 
use  is  free  from  random  error  of  variance.  In  its  broadest  sense,  valid- 
ity is  likewise  defined  as  the  extent  to  which  the  variance  in  a set  of 
scores  is  relevant  to  the  purposes  of  measurement.  In  test  construction, 
the  best  items  are  those  in  which  there  is  a good  natch  between  item 
variance  and  total  test  score  variance.  The  unit  of  measurement  in  men- 
tal testing  is  the  standard  deviation,  aid  the  basis  for  interpreting 


36  - 


test  scores  is  the  relationship  of  one  individual  to  another  in  distribu- 
tion. 

In  short,  the  emphasis  has  been  on  relative  measurement  rather  than 
on  anything  fundamental  or  absolute.  The  contributions  of  classical 
psychometric  theory  have  Deen  substantial,  but  they  have  led  to  some  pecu- 
liar phenomena.  For  example,  grades  in  a course  of  study  such  as  physical 
education  may  be  based  not  on  the  number  of  pushups  one  can  do  or  the 
distance  one  can  swim,  but  rather  upon  how  many  pushups  or  how  many  laps 
one  can  do  in  comparison  to  others  in  the  class.  Special  characteristics 
of  the  class  do  not  enter  into  the  standard  evaluation  of  performance 
using  classical  theory. 

The  illustration  points  out  three  objections  that  have  been  leveled 
against  the  use  of  classical  psychometric  theory  for  many  forms  of  measure- 
ment in  psychology  in  general  and  in  personnel  testing  in  particular: 


1.  The  evaluation  of  measurement  and  of  the  interpretation  of  indi- 
vidual scores  depends  on  the  unique  characteristics  of  the  sample 
of  people  and  the  sample  of  items  studied  in  the  construction  and 
standardization  of  the  test. 

2.  Classical  interpretations  of  scores  provide  no  standard  for  the 
interpretation  of  an  individual  score  beyond  its  relative  posi- 
tion within  the  distribution  of  scores  in  the  1 e of  people 
studied.  If  the  distribution  as  a whole  is  quite  high,  a lew 
score  within  that  distribution  is  treated  as  a poor  score, 
even  if  in  some  absolute  term  it  would  be  considered  high. 

Even  the  techniques  for  estimating  true  scores  are  based  upon 
sample  distribution;  estimation  of  a so-called  true  score  is 
simply  a device  for  acknowledging  the  fallibility  or  unreli- 
ability of  measurement.  It  does  not  take  into  account  the 
relationship  of  that  estimated  true  score  to  any  standard  of 
measurement . 

3.  Classical  measurement  theory  offers  no  definition  of  the  limits 
of  the  usefulness  of  the  test  or  of  the  degree  to  which  the 
classical  statements  of  validity,  reliability,  or  norms  may  be 
generalized.  No  sample  is  ever  precisely  like  the  sample  upon 
which  norm  tables  have  been  built,  but  those  tables  are 


37  - 


consistently  used  for  interpreting  the  scores  of  people  not  in 
that  sample.  1b  what  extent  do  these  interpretations  apply  to 
people  who  are  different  front  the  original  sanple  in  certain 
ways?  lb  what  extent  can  the  standardized  interpretations  of 
scores  as  norms  be  applied  to  different  sets  of  conditions? 

Such  questions  have  no  answer  in  classical  psychometric  theory. 

Three  challenges  to  classical  psychometric  theory  can  be  identified 
and  discussed  as  potential  solutions  to  this  set  of  problems:  content- 
referenced  measurement,  latent  trait  theory,  and  generalizability  theory. 

In  addition,  another  "challenge”  is  based  on  the  fact  that  psychometric 
theory  evaluates  only  inferences  from  scores,  not  the  effects  of  the  uses 
of  such  inferences.  Program  evaluation  will  be  briefly  mentioned  in  this 
context. 

CONIENI^REFERENCED  MEASUREMENT 

The  term  content-referenced  measurement  will  be  used  here  to  apply 
to  any  measurement  technique  developed  explicitly  to  interpret  scores 
relative  to  some  sort  of  standard.  The  nature  of  the  standard  may  vary; 
it  might  be  a relatively  precise  point,  perhaps  with  very  tight  tolerances, 
as  in  measuring  machined  work  products.  It  might  be  a much  more  diffuse 
range  of  measurements,  as  in  defining  a range  of  satisfactory  "scores" 
in  physiological  measurements  associated  with  health.  It  might  be  an 
arbitrary  cutting  point,  above  which  some  people  are  selected  and  others 
rejected.  However  it  is  defined  (and  the  defining  of  a standard  identi- 
fies one  of  the  problems  with  the  relevant  literature) , that  definition 
results  in  interpretations  of  scores  relative  to  the  internal  structure 
or  content  of  the  measuring  ins  Prurient  rather  than  to  a distribution  of 
obtained  measures.  VJiatever  it  is,  and  there  is  much  debate  over  its 
precise  nature,  the  one  point  to  be  emphasized  is  that  content-referenced 
measurement  is  not  norm- referenced  measurement! 

In  keeping  with  the  APA  Standards , the  term  chosen  here  is  oontent- 


- 38  - 


r 


referenced  measurement  in  preference  bo  the  more  carrion  term,  criterion- 
referenced  measurement.  In  most  problems  in  educational  measurement,  the 
distinction  between  the  two  terms  nay  be  trivial  enough  to  explain  why, 
despite  the  preference  in  the  Standards,  the  former  has  not  been  adopted; 
moreover,  the  term,  criterion- referenced , has  been  so  widely  accepted  in 
educational  circles  that  there  is  a very  real  problem  in  attempting  to 
change  it  (Hambleton,  Swaminathan,  Algina,  & Ooulson,  1978).  For  personnel 
testing,  however , the  distinction  is  exceedingly  important.  The  term 
j>  criterion  has  been  widely  used  to  identify  a variable  external  to  the 

test  itself.  It  is  quite  possible,  particularly  in  the  development  of 
work  sample  tests,  to  construct  the  test  so  that  scores  on  it  can  be 
directly  interpreted  in  relation  to  a standard  of  job  performance  (criter- 
ion) measured  externally.  This  may  be  more  than  simply  using  expectancy 
tables  to  interpret  test  scores,  although  that  could  be  one  example.  It 
could  also  imply  that  a work  sample  constructed  to  abstract  various 
components  of  the  job  can  yield  scores  explicitly  tied  to  such  job  per- 
formance measures  as  scrap  rates  or  others.  Such  interpretation  of  scores 
in  relation  to  external  criteria  has  never  been  envisioned  in  the  educa- 
tional measurement  literature  on  so-called  crite-ion-refere  iced  testing, 
but  it  is  inportant  enough  in  personnel  testing  to  warrant  special  efforts 
to  avoid  confusion. 


Moreover,  the  emphasis  on  oontent-referenced  inter pi  ition  acoordinq 

to  the  Standards  refers  to  those  inter; rotations  "where  the  score  is 
directly  interpreted  in  terms  of  performance  at  each  point  on  the  achieve- 
ment continuum  being  measured"  (APA  et  al.,  1974,  p.  19,  emphasis  added). 
iViis ' is  c'ldhrlyn  Afferent-  idee  from  much  of  the  literature  on  criterion- 
referenced  testing,  which  effectively  treats  any  score  in  the  distribution 
E simply  as  above  or  below  a specified  score  or  standard. 

| In  summary,  content-referenced  testing  seems  a preferable  term 

| because  (a)  it  is  more  descriptive,  (b)  it  avoids  ambiguity,  (c)  it  fits 

! the  terminology  of  the  Standards,  and  (d)  it  avoids  any  implication  of 

t 


i 


i 


I 


39  - 


1 


dichotomy.  The  term  is  not  the  only  one  that  might  have  been  chosen. 

The  relevant  literature  includes,  in  addition  to  content- referenced  and 
criterion- referenced  measurement,  standards- referenced  measurement, 
uni verse?- referenced  measurement,  dcra  in- referenced  measurement,  objective- 
referenced  measurement,  and  mastery  testing.  Each  of  these  terms  has 
been  proposed,  and  has  its  adherents,  because  of  a special  emphasis  that 
is  sought.  This  is  a final  advantage  of  the  term  chosen  for  this  report, 
because  it  seems  indebted  to  no  prior  bias. 

The  foregoing  is  more  than  a semantic  exercise.  The  choice  of  lan- 
guage can  influence  substantially  the  directions  taken  in  applying  the 
diverse  literature,  seme  of  which  has  been  spawned  less  from  an  interest 
in  making  a new  contribution  to  measurement  theory  than  in  challenging 
the  old  and  established.  The  concept,  under  whatever  name  is  chosen, 
has  attracted  very  little  attention  among  personnel  testing  specialists. 
Ttenopyr  (1977)  said  that  "the  notion  of  criterion-referenced  test  inter- 
pretation. . . has  no  application  in  an  employment  setting"  (p.  51).  Ebel 
(1977)  seems  to  agree.  The  point  of  their  rejection  of  the  idea  may  be 
as  much  a rejection  of  the  rhetoric  leading  to  dichotomous  scoring  as 
of  the  idea  of  interpreting  scores  relative  to  a standard. 

Certainly  there  are  places  in  personnel  testing  where  one  should 
interpret  measurement  against  same  standard  other  than  the  mean  of  a 
distribution,  even  if  it  means  a dichotomous  interpretation.  Certainly, 
where  productivity  is  determined  by  the  speed  of  a moving  conveyor,  the 
individual  who  cannot  keep  up  with  the  conveyor  belt  is  performing  at  an 
inadequate  level,  whether  that  person  is  at  the  bottom  of  a distribution 
or  merely  a standard  deviation  be lew  the  mean. 

Work  Sarrples  as  Content- Referenced  'Pests.  Work  samples  constitute 
a special  form  of  content-referenced  testing;  the  principal  evaluation 
of  them  is  in  terms  of  job  relevance.  The  previous  discussion  of  content 
done  in  sampling  suggested  that  juiments  of  job- relatedness  can  be 


40  - 


sinplified  by  thinking  of  a four-stage  process  of  defining  the  most  com- 
plete possible  conception  of  the  job  (the  jcb  content  universe) , select- 
ing a dcruin  of  interest  from  that  universe,  and  then  defining  the  related 
test  content  universe  and  detrain. 

A work  sample  test  is  developed  by  sampling  from  that  final  detrain. 

In  some  cases,  one  might  use  work  sanple  techniques  bo  develop  a test 
which  is  not  strictly  a sample  of  work  performance  but  from  which  work 
performance  might  be  inferred.  It  has  become  an  accepted  cliche  for  such 
tests  to  refer  to  "the  inferential  leap."  Figure  3 is  a whimsical 
attempt  to  show  graphically  (and  perhaps  whimsically)  some  limits  to  the 
appropriateness  of  the  term. 

Itests  can  be  developed  which  literally  sample  job  content  adding 
only  enough  testing  operations  to  provide  a scoring  system.  Probationary 
assignments  can  be  carefully  chosen,  and  performance  on  them  can  be  care- 
fully evaluated.  These  are  the  most  complete  sanples  that  can  be  developed 
for  selection  or  certification  purposes.  Simulations  represent,  in  vary- 
ing degrees,  abstractions  of  "real"  job  content;  they  are  less  precisely 
samples,  shorter,  and  more  standardized.  Itests  called  “work  sanples"  are 
usually  also  abstractions  from  job  content,  typically  more  abstract  than 
simulations. 

The  meaning  of  abstraction,  in  this  context,  can  be  illustrated  by 
referring  again  to  the  stenographer's  job.  In  work  sample  testing,  one 
does  not  try  to  create  precisely  every  exact  task  and  every  exact  environ- 
mental condition  influencing  task  performance.  Rather,  one  classifies 
various  kinds  of  tasks  (classification  is  itself  a process  of  abstraction) , 
and  creates  examples  of  the  different  classes;  these,  performed  under 
standard  conditions  and  scored  according  to  rules  which  are  not  part  of 
the  job,  become  the  work  sample  test.  In  all  three  cases,  the  perform- 
ance evaluated  is  a direct  sanple  of  performance  on  the  actual  job.  A 
small  problem  of  inference  may  be  introduced  by  the  scoring  or  evaluation 


- 41  - 


Figure  3.  Samples  and  inferences  in  work  sample  testing 


2 


I 

i 

h 

' 


procedures,  which  can  be  contaminated  by  factors  unrelated  to  real  job 
performance,  but  the  inference  can  hardly  be  said  to  require  a leap. 

A substantial  portion  of  the  job  content  domain,  and  therefore  of  an 
appropriate  test  content  domain,  consists  of  knowledge  required  to  per- 
form the  job.  In  a work  sample  test  consisting  of  tasks  to  be  performed, 
the  examinee  gives  evidence  of  the  prerequisite  knowledge  by  performing 
satisfactorily.  In  many  certif ication  programs,  however,  the  work  sample 
degenerates  into  a test  of  job  knowledge'  alone.  The  verb  has  been  chosen 
judiciously,  for  the  job  does  not  consist  of  knowledge  isolated  from 
action.  (Some  jobs  consist  primarily  of  knowledge.  Where  mastery  of  the 
knowledge  component  is  likely  to  be  a harder  or  more  critical  feature  of 
the  job  than  any  actions  using  it,  a job  knowledge  test  is  one  kind  of 
direct  work  sanple.)  Hie  use  of  the  job  knowledge  test  usually  inplies 
the  inference  that  having  the  knowledge  leads  to  effective  performance. 
Figure  3 suggests  that  this  may  not  lie  a very  great  leap  — more  an 
inferential  step  — but  that  it  is  indeed  more  an  inference  than  a sanple. 
Wien  one  departs  still  further  from  actual  iierforrmnce  of  the  job  content, 
such  as  inferring  prer<.x]uisite  cognitive  skills  or  essential  attitudes, 
the  measurement  of  these  attributes  really  doe;  -equiro  an  inferential 
leap  frem  test  content  to  iob  content. 

The  greater  the  degree  of  abstraction  from  actual  u _ job  assign- 
ments, the  more  appropriate  is  the  metaphor  of  the  leap,  and  also  the 
more  appropriate  is  a criterion- related  validation  strategy.  Wbrk  sanple 
testing,  if  it  is  to  be  accepted  on  its  cwn  terms  as  content-referenced 
testing,  should  be  concerned  more  with  sanpling  than  with  inferring. 

Job  Analysis.  Many  kinds  of  job  analysis  procedures  can  be  used  for 
content-domain  sanpling.  The  procedures  suggested  here  are  illustrative, 
not  prescriptive. 

Briefly,  the  job  analysis  procedure  may  result  in  a series  of  formula 


4 3 - 


statements  of  the  form,  " (T^kes  action) in  (setting)  when 

(action  cue) occurs,  using  (tools,  knowledge,  or  skill) . " 

For  a truck  mechanic,  such  a statement  might  read,  "Flushes  truck  radiator 
in  garage  when  engine  is  said  to  overheat  using  water  under  pressure  in 
flush  tank."  Firm  such  statements,  one  can  specify  what  a worker  does, 
what  knowledge  is  necessary  to  do  it,  where  information  or  material  used 
in  doing  it  comes  from,  and  what  happens  after  the  task  is  finished. 

Such  information  defines  the  tasks,  the  methods,  the  prerequisites,  and 
the  contingencies  that  comprise  the  job  content  universe. 

With  the  job  content  universe  defined,  panels  of  expert  judges  — 
people  who  know  the  job  well  — can  whittle  it  dcwn  to  a test  content 
dorrain  and  can  establish  test  specifications. 

Assembling  Ttest  Content.  In  paper-ard-pencil  testing,  one  refers  at 
this  stage  to  writing  items.  Ihe  "items"  in  a work  sample  test  might  be 
tasks.  Alternatively,  tasks  might  be  "subtests"  and  the  "items"  might  be 
component  characteristics  of  the  process  or  product  evaluated.  In  any 
event,  scorable  elements  of  the  test  are  defined,  developed,  and  assem- 
bled by  experts  on  the  job. 

The  essential  meaning  of  the  scores  depend  on  the  qualifications  of 
the  experts,  the  care  with  which  they  have  reached  the  various  judgments, 
and  their  overall  degree  of  agreement.  If  all  has  been  well  done,  scores 
(whether  overall  or  on  component  tasks)  can  be  interpreted  directly  with 
reference  to  the  content  of  the  test  and  without  reference  to  any  distri- 
bution of  scores. 


Scaling  Itest  Content.  Interpretation  of  scores  with  reference  to 
test  content  can  be  facilitated  and  defended  by  establishing  a formal 
metric  for  scoring.  If  a series  of  components  of  tasks,  or  components  of 
a task  content  done  in,  can  be  arranqed  according  to  a genuine  Guttman 
scale,  all  scores  can  be  interpreted  with  reference  to  points  on  that 


44  - 


scale.  This  idea  grows  out  of  the  illustration  of  "content  standard 
scores"  offered  by  Ebel  (1962).  In  an  arithmetic  test  of  irany  items,  ten 
items  were  selected.  Ib  this  writer,  merely  glancing  at  the  items  iden- 
tified, they  scored  to  fall  along  a scale  of  difficulty.  If  indeed  they 
did  fall  along  a scale,  without  overlapping  discriminal  dispersions,  then 
any  measurement  technique  using  the  other  items  could  be  tied  statisti- 
cally to  the  values  along  that  scale.  Hie  result  would  Lx-  a content- 
referenced  score  with  formal  demonstration  of  transitivity. 

An  example  of  measurement  approaching  this  sort  of  scaling  is  the 
Learning  Assessment  Program  described  by  Grant  and  Bray  (1970) . In  this 
program,  examinees  were  given  a series  of  tasks  to  learn  to  perform, 
seven  in  all;  these  were  ordered  so  that  it  was  necessary  to  have  learned 
how  to  do  task  1,  to  do  task  2,  and  so  on.  Hie  score  for  evaluating 
performance  in  this  program  was  the  level  of  the  tasks  learned.  Hius,  one 
who  learned  five  tasks  in  a reasonable  time  was  considered  more  proficient 
at  the  overall  set  of  tasks  than  one  who  could  only  master  three. 

Hie  same  logic,  it  should  be  noted,  can  be  applied  to  cognitive  skill 
items.  If  it  can  be  shown  that  a subset  of  items  do  form  a reproducible 
scale,  and  if  it  can  be  further  argued  that  these  items  constitute  marker 
variables  for  a particular  construct,  then  the  formal  properties  of  the 
scale  should  provide  a sufficient  operational  definition  for  the  evaluation 
of  a testing  program  using  it. 

Evaluating  Content- Referenced  Hcsts.  Do  classical  concepts  of 
reliability  nd  validity  atiply  to  content- referenced  tests?  Is  it  sensi- 
ble to  develop  a test  to  measure,  let  us  say,  proficiency'  at  ♦'he  end  of 
training  (all  trainees  havinu  at  that  time  mastered  the  material  of  the 
test  and  therefore  exhibiting  no  individual  differences  in  proficiency) , 
and  to  evaluate  that  test  in  classical  terms  defined  on  the  basis  of  test 
score  variance?  Does  i t make  sense  to  use  norm-referenced  concepts  to 
evaluate  content- refereneex!  tests? 


45  - 


jui  j.  jnu.i.11 


1 


Much  oontrovers lal  literature  has  been  devoted  to  such  questions. 

The  controversy  probably  stems  from  the  non  sequitur  imbedded  in  the  second 
question.  It  is  indeed  a non  sequitur  to  equate  measurement  objectives 
with  instructional  objectives.  A desire  to  have  all  trainees  perform  at 
an  equally  high  level  at  the  end  of  training  is  an  objective  demonstrably 
different  from  a desire  to  measure  performance  at  that  level.  An  analogy 
would  be  a Procrustean  desire  to  stretch  all  little  boys  during  their 
period  of  growth  so  that  they  can  all  be  basketball  players  exactly  seven 
feet  tall.  Success  in  the  venture  would  lead  to  measures  of  height  that 
have  no  variance;  it  does  not  follcw  that  the  yardstick  used  should  be 
mcajoable  of  identifying  other  heights!  Neither  does  it  follow  from  recog- 
nizing this  absurdity  that  variance-based  statistics  for  determining 
reliability  and  validity  are  the  appropriate  evaluations. 

In  psychological  measurement  generally,  validity  has  been  an  over- 
rated approach  to  evaluation;  in  work  sample  testing,  validity  conoepts 
are  far  less  important  evaluations  than  are  evaluations  of  job  relevance. 
Content-referenced  work  sanples  developed  according  to  the  principles 
outlined  above  are  assuredly  job-related  solely  because  of  tire  method  of 
their  construction.  Such  a test,  if  scored  with  reference  to  a formal 
Guttran  scale,  could  be  evaluated  pai'ticularly  highly  because  of  the 
meaningfulness  of  the  metric.  It  is  unfortunate  t!  at  preoccupation  with 
the  concept  of  validity  in  classical  measurement  theory  should  rake  test 
users  so  willing  to  ignore  the  quality  of  measurement  tor  so  in  their 
evaluations  of  the  use  of  a test. 

Id  assert  that  validity  is  an  over-rated  concept  does  not  deny  its 
real  importance.  In  any  sort  of  measurement  where  inferences  are  to  be 
drawn  beyond  the  descriptive  character  of  the  mens  .tuna  instrument,  the 
form  of  validity, generally  called  construct  validity,  is  essential; 
l nothing  in  content- referenced  measurement  relieves  i t of  the  obligation 

l to  be  concerned  over  construct  validity.  Content  domain  somplinq  offers 

. the  first,  and  [jerhans  the  only  tu'cessary,  validity  of  inferences  of 

! 

i 

t 


4ft  - 


1 


.ibility  to  do  the  job  as  sampled.  If,  however,  the  intended  interpreta- 
tion  of  the  soon'  seems  to  include  someth i nq  more  than  tile  test  content 
(a  frequent  cast') , six'll  as  mastery  or  c< -mpe fence , then  the  score  imp! ies 
expectations  the  soundness  of  which  must  be  demonstrated  by  the  usual 
lines  of  evidence  of  construct  validity  (hinn,  1977). 

That  evidence  ntiy  require  exjx-ri mental  data  showinq  that  variance 
witli  groups  judqtxi  as  ixmjx'tent  (or  nvisti'rs)  is  lew  relative  to  the  variance 
between  thtise  tin  nips  and  otliers  judqed  as  less  competent  (or  nonnusters) . 
Traditional  validity  cx'ff icients  nviy  be  useful,  wliert'  obtained  variances 
ix'imit  them,  as  results  of  inquiries  into  different  aspects  of  the  con- 
struct validity  of  the  scores.  Also,  scores  (or  observations)  on  content - 
referenced  tests  must  lx-  reliably  determin'd,  althouqh  tlie  nature  of 
reliability  nviy  be  txinvrnt  ional  estinvites  of  systematic  variance,  studies 
of  the  general izabi 1 ity  of  scores,  »•  the  consistencies  of  classifications. 

lATHNT  TRAIT  IWXiRV 

Under  various  nmv's  (latent  structure  analysis,  item  characteristic 
cuivo  theory,  Rasch  ntxlel),  latent  trait  thtxaries  constitute  another 
approach  to  the  construct  ion  of  formal  measurinq  instruments.  Hie  distin- 
guishing tn^xirtanoe  of  the  method  is  that  it  defines  item  difficulties  arxi 
other  characteristics  more  or  less  independently  of  characteristics  of  the 
particular  samples  Horn  which  the  data  distributions  are  drawn. 

Originally  develop'd  for  the  assessment  of  attitudes  (Lazarsfeld, 

1950),  latent  trait  theory  lias  subsequently  been  used  mainly  in  the 
measurement  of  cognitive  abilities  (lord  t.  Nnvick,  1968;  Hambleton  & Cook, 
1977).  it  can  lx'  used  for  at  b'ast  some  forms  of  work  sample  testinq. 

Af.pl  i cat  ions  to  tests  of  knowledge  have  been  shown  by  Bejar,  Weiss,  & 
oialluca  (1977),  and  an  application  to  p'rsona 1 i ty  measurement  by  Bejar 
(1977)  seems  direct  ly  ajplicable  to  measures  of  the  inial  i ty  of  work  sample 
products  aixi  other  pract ical  problems  of  [x'rsonnel  testinq. 


17  - 


I 

i 


The  Theoretical  Foundations.  Although  the  mathematical  foundations 
1 1 of  latent  trait  theory  are  beyond  both  the  saope  of  this  report  and  the 

>'  abilities  of  the  writer,  a brief  account  of  the  nature  of  the  theory  is 

l useful  for  discussions  of  its  applicability. 

c An  item  characteristic  curve  can  be  identified  in  winch  the  probabil- 

ity of  a correct  response  to  the  item  is  seen  as  a function  of  the 
examinee's  ability  level.  Various  models  exist  for  defininq  the  function, 

f one  of  which  describes  the  item  characteristic  curve  as  a normal  oqive. 

f 

f Figure  4 shows  hypothetical  item  characteristic  curves  for  tiiree  items. 

[tern  1 lias  a fairly  typical  difficulty  level;  many  people  get  it  wrong, 
but  many  get  it  right.  Item  2 is  a very  difficult  item;  only  people  of 
very  high  ability'  are  likely  to  get  it  right,  although  people  of  low 
ability  seem  to  get  it  right  by  quessinq  more  often  than  on  the  other  items. 
Item  J is  a highly  discriminating  item;  most  people  with  above  average 
ability  will  get  it  right,  .and  those  with  ability'  below  average  are 
unlikely  to  give  a correct  response. 

'fliree  parameters  can  be  estimated  for  defining  each  of  these  curves. 
Parameter  a is  a discrimination  index,  proportional  to  the  slope  of  the 
curve  at  the  inflection  point.  Parameter  b is  a difficulty  index, 
defined  as  the  ability  level  on  the  base  line  corresponding  to  the  point 
of  inflection  (the  point  corresponding  to  a .50  probability-  of  correct 
response  if  the  third  parameter  is  zero) . Parameter  c is  the  probability 
of  a correct  rcsfxinse  at  infinitely  lew  ability  levels,  often  called  the 
guessing  parameter.  Parameters  estimated  in  a given  analysis  include 
the  ability  levels,  identified  as  theta  (0)  in  Figure  4,  of  the  people 
tested  as  well  as  the  item  parameters. 

The  theta  scale  nay  be  defined  arbitrarily  in  any  given  analysis; 
the  numerical  values  of  the  difficulty  }«rameters  are  therefore  arbitrar- 
ily expressed  for  a uivon  sample . However,  parameters  estimated  from 
sanples  with  different  characteristics  correlate  very  highly,  even  if  one 


48  - 


e 

Latent  Ability 


Figure  4.  Item  characteristic  curves  of  three 
hypothetical  items 


- 49  - 


sample  consists  of  the  low- scoring  half  of  a distribution  and  the  other 
sample  consists  of  the  high-saoring  half  (Rudner,  1976).  Available 
equating  procedures  permit  merging  the  latent  ability  scales  for  the  pop- 
ulation as  a whole  and  expressing  item  characteristics  in  terms  of  that 
common  scale.  Ihe  resulting  item  characteristic  curves  are  essentially 
congruent  regardless  of  the  sample  from  which  they  were  developed. 

Failure  to  obtain  such  congruence  indicates  either  a poor  fit  of  the 
model  or  the  possibility  of  item  bias  (Ironson,  1977) . 

Ihe  description  presented  here  with  Figure  1 (three  parameters  defin- 
ing a normal  ogive)  refers  to  one  of  many  models  for  latent  trait  analysis, 
"hero  are  logistic  curves  as  well  as  normal  ogives,  and  there  are  models 
that  estinete  only  one  or  two  of  the  parameters.  Hie  "two-parameter'"' 
models  estimate  discrimination  values  and  difficulties;  the  "one- 
paramter"  models  estimate  only  difficulty  levels.  Multidimensional  as 
well  as  unidimensional  models  have  been  proposed,  and  models  are  avail- 
able for  dichotomous,  po 1 ychotcmous , graded,  or  continuous  responses 
(Samejima,  1969,  1972,  1973) . 

In  classical  psychometric  theory,  the  standard  error  of  measurement 
is  generally  treated  as  equal  across  the  range  of  a distribution  of 
scores.  Its  counterpart  in  latent  trait  analysis,  the  standard  error  of 
the  estimate  of  ability,  varies  with  the  ability  level.  It  is  possible 
to  construct  item  information  curves  shewing  the  precision  of  the  estima- 
tion of  ability  frem  responses  on  a single  item  at  different  ability 
levels.  Tests  measuring  the  same  latent  ability  on  the  camion  scale  can 
be  assembled  with  different  oerbinations  of  items,  each  with  different 
item  characteristic  curves  and  item  information  curves.  Combining  item 
information  curves  across  items  yields  a test  infornetion  curve,  the  high- 
est point  of  which  is  the  level  of  ability  at  which  the  information  (that 
is,  the  precision  of  estirating  ability)  provided  by  that  set  of  items 
is  greatest.  Item  characteristic  curves  may  likewise  be  combined  to 
yield  a test  characteristic  curve  in  which  the  probability  of  an  obtained 


- 50  - 


score  is  a function  of  the  underlying  ability  level. 


Uses  of  Latent  Trait  Analysis.  If,  for  a given  item,  the  item 
characteristic  curves  for  two  distinguishable  groups  of  people  are  not 
essentially  congruent,  then  that  item  cannot  be  said  to  be  measuring  the 
same  latent  ability  in  those  two  groups.  Therefore , latent  trait  theory 
can  be  used  to  identify  sources  of  item  bias  across  race  or  sex  groups. 

This  has  implications  for  judgments  about  the  adverse  impact  of 
tests  used  as  decision  tools.  If  there  are  substantial  differences  in 
obtained  score  distributions,  the  proportions  of  the  groups  selected  (or 
classified  into  a desirable  category)  will  differ.  Current  governmental 
regulations  governing  the  use  of  employment  procedures  call  for  investi- 
gations to  determine  which  of  alternative  selection  tools  will  have  lesser 
adverse  effect,  that  is,  which  tests  will  have  smaller  mean  differences 
in  test  performance. 

If  there  are  true  subgroup  differences,  psychometric  properties  of 
the  tests  my  affect  the  size  of  the  adverse  effect.  Highly  unreliable 
tests  will  have  little  adverse  effect,  for  exanple.  The  problem  can  be 
highlighted  by  looking  at  test  characteristic  curves.  The  true  differ- 
ences in  ability  (as  shown  by  the  mean  estimates  of  latent  ability)  are 
not  influenced  by  the  choice  of  test,  but  observed  differences  are.  A 
test  with  a smller  slope  on  that  curve  will  show  less  adverse  effect 
than  will  a test  with  a characteristic  curve  that  is  steeper.  In  other 
words,  even  though  the  true  differences  are  not  changed  by  changing  the 
test,  the  observed  differences  my  be  markedly  greater  for  one  test  than 
for  another  — and  both  can  err  in  opposite  directions.  One  of  the 
tests  my  falsely  exaggerate  the  true  differences,  while  the  other  my 
falsely  minimize  them. 

Working  from  item  and  test  information  curves,  one  can  assemble 
small  sets  of  items  yielding  the  most  precise  possible  ability  estimates 


51  - 


at  different  ranges  of  ability  levels  (Lord,  1968;  Weiss,  1974).  If 
care  has  been  taken  to  assure  a full-range  scale  of  ability  in  the 
development  of  an  item  bank,  with  known  item  characteristic  curves,  then 
any  individual  can  be  tested  and  located  along  that  scale  even  using  a 
unique  set  of  items.  Once  the  individual  is  located  on  that  scale,  the 
interpretation  of  his  score  is  content  referenced.  For  personnel  testing, 
tests  can  be  tailored  not  only  for  individuals  but  also  for  individual 
■jobs  requiring  different  levels  of  a particular  ability,  and  standards 
for  each  job  can  be  defined  in  terms  of  the  ability  levels  appropriate. 

Evaluation.  Tests  constructed  using  latent  trait  analysis  can  be 
evaluated  with  conventional  concerns  for  job  relatedness,  reliability,  and 
validity,  but  they  my  be  better  evaluated  in  other  ways. 

Job  relatedness  of  work  sanples  constructed  by  latent  trait  studies 
is  no  different  from  job  relatedness  of  other  work  sample.  In  either 
case,  it  depends  on  the  quality  of  the  judgments  made  in  defining  the  job 
content  universe  and  in  moving  logically  from  that  definition  to  a set  of 
test  specifications.  Latent  trait  theory  my,  however,  make  it  possible 
to  develop  abbreviated  work  sanples  that  will  be  equally  job  related  by 
identifying  oonponents  that  will  maximize  information  at  different  levels 
of  proficiency. 

In  latent  trait  theory,  classical  reliability  is  replaced  by  the 
idea  of  the  informtion  curve.  Reliability  coefficients  can  be  manipulated 
by  mnipuilating  sanples  (Samejim,  1977) ; they  are  not  sample- free.  The 
standard  error  of  measurement  is  a general  statistic  applying  to  all 
examinees  in  a distribution  (or,  if  specially  oonputed,  in  a specified 
broad  range  of  the  distribution) . The  standard  error  of  estimte,  however, 
is  a value  describing  the  precision  of  measurement  at  a particular  point 
i on  the  ability  scale  and  is  therefore  far  more  informtive.  The  test 

informtion  curve  gives  evaluative  informtion  similar  to  that  provided 
; by  reliability  coefficients,  but  it  does  it  better. 


- 52  - 


K 


i 


1 


Construct  validity  is  less  important  in  latent  trait  studies  than  the 
fit  of  data  to  a model.  If  the  data  obtained  from  the  items  will  indeed 
fit  a latent  trait  model,  they  are  certainly  measuring  something  and 
doing  so  with  internal  consistency.  Item  construction  proceeds,  of  course, 
in  the  context  of  a particular  construct,  so  it  is  not  difficult  to 
define  the  underlying  trait  dimension.  Construct  validity,  if  of  interest, 
is  further  assured  if  biased  items  (or  items  with  other  evidence  of  poor 
fit)  are  eliminated  from  the  test  or  item  pool  as  potential  sources  of 
contamination . 

In  general,  however , validity  statements  are  superfluous.  The 
amount  of  research  that  goes  into  the  development  of  such  tests  is  indeed 
substantial.  Viien  that  research  has  been  completed,  and  measurement  is 
expressed  in  terms  of  the  underlying  scale,  tliat  measurement  is  a suffi- 
ciently satisfactory  operational  definition  of  the  construct  being  measured; 
no  additional  recourse  to  concepts  of  validity  is  necessary  or  informative. 

GENERALI  ZABIL1TY  THEORY 

Generalizability  theory  (Cronbach,  et  al.,  1972)  does  not  challenge 
the  norm- referenced  basis  of  classical  psychometric  theory;  it  is,  in 
fact,  an  extension  of  classical  theory.  The  challenge  it  poses  is  the 
challenge  to  the  undifferentiated  distribution  of  error  implicit  in  the 
classical  formulation  of  true  scores  and  error  scores  comprising  an 
obtained  score.  Moreover,  estimation  of  error  in  psychometric  theory  is 
built  on  the  requirement  of  parallel  tests,  a condition  not  regularly 
satisfied  in  psychological  measurement. 

Any  observed  score  is  based  on  measurement  obtained  under  a specified 
set  of  conditions.  That  set  of  conditions  is  but  a sample  of  all  of  the 
possible  sets  that  might  have  existed.  Recognizing  this,  Cronbach  and 
his  associates  ask  investigators  to  define  the  universe  of  conditions, 
or  the  universe  of  possible  observations , under  which  a person  might  be 


- 5 3 - 


1 


tested.  One  generalizes  in  any  actual  use  of  tests  from  the  sairple  bo  a 
universe  of  applicable  conditions;  generalizability  studies  make  it  possi- 
ble to  define  the  limits  of  possible  generalizability  for  any  test,  a 
result  particularly  valuable  in  work  sarrple  testing. 

An  illustration  of  this  implication  may  be  helpful.  Suppose  that  a 
work  sample  test  is  devised  for  measuring  a specified  skill  at  the  end  of 
training.  Suppose,  moreover,  that  the  test  is  administered  under  traditional 
ideas  of  good  test  administration:  good  lighting,  giving  instructions 
carefully  and  consistently,  special  efforts  to  ascertain  the  reliability 
of  observation,  and  a general  effort  to  provide  conditions  optimally 
suited  for  maximizing  performance  of  the  examinee  or  reliability  of  the 
observations. 

Now,  no  one  is  really  interested  specifically  in  how  well  the  indivi- 
dual performs  at  the  end  of  training  except  possibly  the  trainers.  From 
an  organizational  point  of  view,  the  measurement  of  skill  at  the  end  of 
training  is  intended  to  generalize  to  conditions  less  optimal  but  more 
realistic,  that  is,  to  field  rather  than  institutional  settings.  Obvious- 
ly, there  can  be  many  different  kinds  of  field  conditions.  Conditions 
can  vary  according  to  light  sources,  according  to  geographical  climate, 
or  according  to  variations  in  degrees  of  situational  hostility. 

A qeneralizability  study,  or  a multiple  facet  analysis  as  it  is  also 
known,  can  be  designed  to  determine  the  degree  bo  which  scores  obtained 
in  a sairple  measured  under  optimal  conditions  can  be  generalized  to  the 
different,  non-optimal  conditions  of  the  study.  Three  possible  kinds  of 
findings  can  emerge:  one  may  find  that  the  inferences  generalize  quite 
well  across  conditions,  one  may  find  that  they  aeneralize  not  at  all,  or 
one  may  find  that  they  will  generalize  to  a limited  subset  of  conditions, 
that  is,  that  generalization  across  facets  is  possible  only  by  the 
deletion  of  certain  conditions. 


- 54  - 


One  other  point,  too  important  for  the  possible  implications  of 
generalizability  theory  for  work  sample  testing  to  be  omitted  from  this 
brief  discussion,  is  that  the  method  permits  one  to  estimate  universe 
scores  or  expected  obtained  scores  under  specifiable  combinations  of 
facets.  That  is,  even  if  there  are  substantial  differences  in  perform- 
ance under  different  sets  of  conditions,  one  may  be  able  to  generalize 
beyond  the  initial  condition  by  making  estimates  of  the  obtained  scores 
that  would  be  expected  under  specified  kinds  of  field  conditions. 

Program  Evaluation.  Alternatives  to  conventional  validation  proce- 
dures include  evaluations  of  total  programs  using  personnel  tests.  The 
use  of  assessment  centers,  in  particular,  has  led  to  a situation  in  which 
the  predictor  is  no  longer  a single  test  or  small  battery  but  the  outcome 
of  a complex  assessment  procedure  expressed  as  the  judgment  of  observers. 

A less  formal  version  of  the  same  kind  of  thing  occurs  in  an  employ- 
ment office  where,  instead  of  using  a test  and  expectancy  chart  or  cutting 
score,  a series  of  assessment  devices  will  be  selected  depending  on  the 
questions  a decision-maker  wishes  to  answer  about  a part! cular  candidate 
for  a particular  job.  Different  batteries  of  tests  nay  be  used,  differ- 
ent weights  nay  be  given  to  the  same  tests,  and  different  questions  nay 
be  asked.  The  procedure  is  frequently  called  clinical  judgment  or 
clinical  prediction. 

The  total  besting  program  including  judgments  or  decisions,  can  be 
evaluated  in  such  circumstances  if  a quasi -experimental  design  can  be 
used  to  compare  the  effectiveness  of  the  performance,  or  work  force 
stability,  or  other  outcome  in  organization  using  the  program  to  that  in 
a different  organization,  reasonable  well  matched  with  the  first,  in 
which  the  program  is  not  in  use. 

In  a sense,  this  is  criterion-related  validation  of  the  final  judg- 
ment. It  is,  however,  more  in  line  with  modem  concerns  for  program 


- 55  - 


evaluation,  and  it  is  mentioned  here  as  a potential  stimulus  to  exploring 
the  literature  on  program  evaluation  for  its  possible  triplications  in  the 
evaluation  of  personnel  testing  programs. 

SUMMARY 

Personnel  testing  programs  have  traditionally  been  evaluated  in 
terms  of  the  classical  psychometric  concepts  of  validity,  particularly  of 
criterion-related  validity.  The  habit  is  well  entrenched.  Both  the 
Standards  (APA,  et  al.,  1974)  and  the  Principles  for  the  Validation  and 
Use  of  Personnel  Selection  Procedures  (Division  of  Industrial-Organization.il 
Psychology,  1975)  give  institutional  support  and  encouragement  to  the 
habit.  It  is  not  a bad  habit,  like  smoking,  hazardous  to  the  user's 
health  and  therefore  to  be  broken;  rather  it  is  like  eating,  a habit  to 
be  tenpered  with  moderation.  Classical  notions  of  validity  have  boon 
valuable,  but  there  are  evaluative  concepts  that  are  more  useful  for 
some  uses. 

One  of  the  difficulties  with  classical  notions  of  validity  is  tli.it 
there  are  too  many  of  them  and,  in  personnel  testing,  they  have  been 
forced  to  fit  into  too  nrany  Procrustean  beds.  Hie  basic  notion  of 
validity  as  an  evaluation  of  measurement  has  been  stretched  into  something 
called  content  validity  and  squeezed  into  something  else  called  criterion- 
related  validity,  neither  of  which  refers  to  the  quality  or  meaningful ness 
of  measurement  per  se.  Only  investigations  of  construct  validity  provide 
useful  insights  into  the  meaning  of  measurement;  what  is  called  content 
validity  is  really  better  understood  as  content-oriented  test  development, 
and  criterion- re la ted  validity  is  in  reality  the  outcome  of  a test  of  a 
hypothesis . 

In  personnel  testing,  criterion- related  validity  holds  a place  of 
high  honor.  It  is  on  established,  useful  approach  for  demonstrating 
the  relationship  of  performance  on  the  test  to  [X'rfonTvince  on  the  job  — 


- 56  - 


. ."uumw 


a phrase  which,  when  abbreviated,  becomes  job  relatedness. 

Job  relatedness,  or  job  relevance,  is  the  most  important  single 
consideration  in  the  evaluation  of  most  personnel  testing  procedures, 
whether  the  testing  is  used  to  predict  future  performance,  certify 
competency,  evaluate  performance,  or  validate  seme  other  variable.  Criter- 
ion-related validity  is  a good  source  of  evidence  for  judging  the  job 
relatedness  of  a test,  but  it  is  not  the  only  one. 

Kauai  ly  important  evidence  of  job  relatedness  is  shewinej  that  the 
test  is  an  acceptable  operational  definition  of  important  aspects  of  job 
fjorformance.  Such  a showing  is  based  primarily  on  a thorough,  rational 
process  of  getting  i n formation  about  a job  and  using  expert  opinion  in 
defining  domains,  test  specifications,  and  the  relevance  of  individual 
items  within  the  test.  Surely,  such  information  is  at  least  on  par  with 
evidence  of  criterion- related  validity  serendipitous ly  found  using  a 
test  developed  for  a wide  variety  of  general  uses. 

Another  vitally  irrportant  consideration  in  the  evaluation  of  a test 
is  the  meaningful  ness  of  scores  obtained  through  its  use.  Meaningfulness 
can  be  established  in  part  through  the  methods  of  establishing  construct 
validity  or  from  the  methods  of  test  construction.  A very  specific  kind 
of  meaning  is  derived  through  criterion-related  studies.  A quite  differ- 
ent but  perhaps  equal Ly  valuable  source  of  meaning  is  the  concept  of  a 
latent  trait. 

In  short,  a score  on  a personnel  test  becomes  meaningful  in  a 
variety  of  ways.  It  is  meaningful  if  it  can  be  interpreted  in  terms  of 
a predicted  level  of  future  performance  or  of  a probability  of  attaining 
seme  stated  level  of  performance.  It  is  meaningful  if  it  can  be  inter- 
preted as  a proficiency  treasure  on  a sanple  of  the  actual  job.  It  is 
meaningful  if  it  can  be  interpreted  directly  in  terms  of  a standard 
f performance  or  in  terms  of  a scale  reflecting  the  variable  being  measured 


i 


without  reference  to  an  idiosyncratic  distribution  obtained  from  an 
available  sample  of  people  — or,  for  that  matter,  of  items.  Its 
meaningf ulness  is  enhanced  to  whatever  degree  it  can  be  expressed  as  a 
score  on  a meaningful  scale  which  retains  that  meaningfulness  over  a wide 
variety  of  circumstances . A content- referenced  interpretation  is  at 
least  as  meaningful  as  a criterion-referenced  interpretation  (using  the 
term  here  in  its  unusual  sense  of  a score  interpreted  in  terms  of  an 
external  criterion  variable) . Thus  methods  of  scaling  or  calibrating 
tests  (such  as  latent  trait  analysis)  need  to  be  given  a priority  at 
least  as  high  as  that  given  to  criterion-related  validation  in  evaluating 
the  meaningfulness  of  scores. 

Classical  test  theory  also  evaluates  tests  in  terms  of  reliability, 
running  the  freedom  within  a distribution  of  test  scores  from  variance 
due  to  random  error.  Classical  notions  of  reliability  do  not  take 
systematic  error  into  account.  The  application  of  the  reliability  concept 
to  the  evaluation  of  a single  score  is  through  the  standard  error  of 
measurement,  a value  generally  taken  to  be  the  same  tliroughout  the  entire 
distribution  of  scores. 

These  are  also  useful  evaluations,  but  they,  too,  can  be  improved 
uf.on  through  the  use  of  newer  ideas.  Latent  trait  theory,  for  example, 
replaces  the  reliability  theme  with  the  idea  of  the  information  curve, 
using  tJ i*  s tan. Lard  error  of  estimated  abilities  as  an  index  of  precision 
at  specific  ability  levels.  Genera lizability  theory  offers  a much  more 
comprehensive  and  useful  account im  ot  various  sources  of  error  and 
their  magnitudes,  and  it  permits  statements  of  both  the  limits  of 
general izability  -mil  the  estimates  of  stores  in  different  sets  of  conditions. 

Modem  measurement  theory,  although  it  has  offered  challenges  to 
classical  psychometric  theory,  has  not  reduced  the  usefulness  of  classi- 
cal evaluations,  especially  in  situations  such  as  the  use  of  tests  or 
ratings  in  the  measurement  of  such  variables  as  attitude  or  personality 


r>8  - 


characteristics.  For  many  other  variables  and  for  other  methods  of 
measurement,  however,  personnel  testing  needs  bo  explore  and  exploit 
the  possibilities  of  the  newer  theories.  These  possibilities  are  parti- 
cularly relevant  to  work  sample  testing  because  it  is  most  appropriately 
evaluated  in  terms  of  job  relevance  and  its  amenability  to  content- 
referenced  interpretations  of  scores. 


REFERENCES 


American  Psychological  Association,  American  Educational  Research 
Association,  and  National  Council  on  Measurement  in  Education. 
Standards  for  educational  and  psychological  tests.  Washington, 


D.C.:  American  Psychological  Association,  1974. 

Bejar,  I.  I.  An  application  of  the  continuous  response  level  model 
to  personality  measurement.  Applied  Psychological  Measurement, 
1977,  1,  509-521. 

Bojar,  I.  I.,  Weiss,  D.  J. , & Gialluca,  K.  A.  An  information  conpari- 
son  of  convocations 1 and  adaptive  tests  in  the  measurement  of 
classroom  achievement . (Resch.  Rep.  77-7).  Minneapol is : Univer- 
sity of  Minnesota,  Psychometric  Methods  Program,  1977. 

Cronbach,  L.  J.  Test  validation.  In  R.  L.  Thorndike  (Ed. ) , Educa- 
tional measurement  (2nd  ed.)  Washington,  D.C. : American  Council 
on  Education,  1971. 

Cronbach,  L.  J. , Gleser,  G.  C. , Nanda,  H. , & Rajaratnam,  N.  The 

deiiendabi.lity  of  behavioral  measurement.  New  York:  Wiley,  1972. 

Q-onbach,  L.  J.,  & Meehl,  P.  E.  Construct  validity  in  psychological 
tests.  Psychological  Bulletin,  1955,  52,  281-302. 

Division  of  Industrial-Organizational  Psychology.  Principles  for 

the  validation  and  use  of  personnel  selection  procedures . Payton: 
Author,  1975. 

Ebel , R.  I,.  Obtaining  and  reporting  evidence  on  content  validity. 
Educational  and  Psychological  Measurement,  1956,  1G,  269-282. 

Ebel,  R.  L.  Must  all  tests  be  valid?  American  Psychologist,  1961, 

1C,  640-647. 

Ebel,  R.  L.  Content  standard  test  scores.  Educational  and  Psycholog- 
ical Measurement,  1962,  22,  15-25. 

Ebel,  R.  I,.  Comments  on  some  problems  of  r;g  lc '.ment  testing. 

Personnel  Psychology,  19  7,  30,  55-63. 

Grant,  D.  1,. , &>  Bray,  D.  W.  Validation  of  orv  laymen t tests  for  tele- 
phone company  installation  arid  repair  je>ns.  Journal  of 

App  1 ia Psychology , 54,  7-14. 

Onion,  R.  M.  Personnel  test  mu.  'J.v;  York:  Kii  : , 19*>5. 


Guion,  R.  M.  Recruiting,  selection,  and  job  placement.  Ln  M.  D. 
Dunnette  (fid.),  Handbook  of  industrial  and  organizational 
psychology . Chicago:  Rand  McNally,  1976. 

Guion,  R.  M.  Content  validity  — the  source  of  ny  discontent. 
Applied  Psycho logical  Measurement,  1977,  _1,  1-10. 

Guion,  R.  M.  Scoring  of  content  domain  sanples:  The  problem  of 
fairness.  Journal  of  Applied  Psychology,  in  press. 

Gulliksen,  H.  Intrinsic  validity.  American  Psychologist,  1950,  5, 
511-517.  ' “ 


Hambleton,  R.  K. , i»  Cook,  L.  L.  latent  trait  models  and  their  use  in 
the  analysis  of  educational  test  data.  Journal  of  Educational 
Measurement , 1977,  14,  75-96. 

Hambleton,  R.  K. , Swaminathan,  ii. , Aigina,  J. , & Coulson,  D.  B. 
Criterion- referenced  testing  and  measurement:  A review  of 
technical  issues  and  developments.  Review  of  Educational 
Research,  1978,  4_8,  1-47. 

Hull,  C.  L.  Aptitude  testing.  Yonkers,  N.Y. : Work  Book,  1928. 

Ironson,  G.  H.  A ixunparnt  ive  study  of  several  methods  of  assessing 
item  bias,  llnpubl  ished  ckx:toral  dissertation,  University  of 
Wisconsin-Madison,  1977. 

Lazarsfeld,  P.  F.  Ihe  logical  and  mitherrotical  foundation  of  latent 
structure  analysis.  In  S.  A.  Stouffer  otal.  Measurement  and 
prediction.  New  York:  Wiley,  1950. 

Linn,  R.  L.  Issues  of  validity  in  measurement  for  cor.; xtency-based 
p rex  j rams . Patier  presented  at  tlie  meeting  of  the  National 
Council  of  Measurement  in  Education,  New  York,  April,  1977. 

Lord,  F.  M.  Some  test  theory  for  tailor  testing  (ETS  RB-68-38) . 
Princeton,  N.J.:  Educational  Testing  Service,  1968. 

Lord,  F.  M. , & Novick,  M.  R.  Statistical  theories  of  mental  test 
scores.  Reading,  Mass.:  Add i son- Wes 1 ey , 1§68. 

Messick,  S.  The  standard  problem:  Moaning  and  values  in  measurement 
and  evaluation.  American  Psychologist , 1975,  30,  955-966. 

Rudner,  I,.  M.  Item  and  format  bias  and  appropriateness.  Washington, 
D.C.:  Model  Secondary  School  for  the  Deaf,  1976. 


61  - 


Samejima,  F.  Estimation  of  latent  ability  using  a response  pattern 
of  graded  scores.  Psychcmetrika  Monograph  No.  17,  1969. 

Sam"  jima,  F.  A general  mcdel  for  free- response  data.  Psychometrika 
Monograph  No.  18,  1972. 

Samejima,  F.  Homogeneous  case  of  the  continuous  response  model. 
Psychorretrika , 1973,  38,  203-219. 

Samejima,  F.  A use  of  the  inforrmtion  function  in  tailored  testing. 
Applied  Psychological  Measurement,  1977,  ^L,  233-247. 

Schmidt,  F.  L. , Hunter,  J.  E. , & Urry,  V.  W.  Statistical  power  in 
criterion- related  validation  studies.  Journal  of  Applied 
Psychology,  1976,  61,  473-485. 

Stanley,  J.  C.  Reliability.  In  R.  L.  Thorndike  (Ed.),  Educational 
measurement  (2nd  ed.),  Washington,  P C. : American  Council  on 
Education," 1971 . 

lbnopyr,  M.  L.  Content-construct  confusion.  Personnel  Psychology, 

1977,  30,  47-54.  " 

Weiss,  D.  J.  Strategies  of  adaptive  ability'  measurement  (Resch. 

Rep.  74-5K  Minneapolis:  University  of  Minnesota,  Psychometric 
Methods  Program,  1974. 

Wright,  B.  D.  Sanple-free  test  calibration  and  person  measurement. 

Proceedings  of  the  1967  Invitational  Conference  on  testing  problems. 


Princeton,  N.J. : Educational  Testing  Service,  1968. 


